Amsterdam is the capital city of the Netherlands, known for its stunning history, the city has a standing legacy since the 17th century (golden age). Amsterdam has a population of 821,752, as the population grows, the numerous possible factors that affects the rent are also growing. A few examples of this are area of residence, amount and size of bedrooms, bathroom facilities and internet access. The sheer number of different marketplaces for housing are increasing rapidly. On top of that, these marketplaces do not all work the same way and may have varying features.
Therefore, companies like funda and kamernet were developed but, those companies display only the houses available, and sometimes these houses do not have the desired features. Thus, the user can not know the estimated price of a house, due to it not yet existing or not yet being available. In addition to that, if manual estimations of the house price were to take place, that would take quite the time, and it could be hours or days. Therefore, using Artificial intelligence to predict the rental price of a house in Amsterdam, based on the important features is valuable for homeowners or real estate agencies. Hence, no matter what the features of the house are, type of housing or location, the agency is able to provide the information easily.
In this project we will be looking at a dataset about the renting price in Amsterdam where I will check the rent prices of the houses based on important factors such as area, bedrooms, toilet, internet, etc.
In the current situation an individual could use multiple websites with different selections of available housing to find a potential future home. The sheer number of these different marketplaces are increasing rapidly. On top of that, these marketplaces do not all work the same way and may have varying features. This can make it hard for somebody looking for a house to make a financial calculation regarding what they will be paying in rent. Especially considering certain homes which are not available right now might become available, while in the meantime not being listed on a website. This can make it hard to accurately predict what one could be paying for future accommodation, making it hard to do proper budgeting, both for new tenants and existing tenants who might want to move into a bigger house.
These factors combined with the relative lack of knowledge among international newcomers about the rules and rights of renting in the Netherlands lead to another possible gap within the market. (I Amsterdam, 2020) People who fall into this category can be easily preyed upon and exploited by potential landlords through for example the state of delivery, extra fees, security deposits and other hidden costs which may or may not be illegal in nature.
The goal of this project is to forecast the housing rent prices for potential local tenants as well as international newcomers with respect to their Budget, financial plan, and preferred features.
Can you help new coming students to identify the base renting price of negotiation and identify scams given house price prediction?
• What would be the estimated rent price for a tenant in Amsterdam given a housing with specific coordinates, housing square meter and tenancy agreement (i.e., shared, or individual)?
• What would be the estimated margin of the price based on the predication?
Furthermore, more EDA and analysis will be done below.
import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display
from sklearn.model_selection import GridSearchCV
import seaborn as sns
import missingno as msno
pd.set_option('display.max_columns', None)
print('numpy version:', np.__version__)
print('pandas version:', pd.__version__)
print('scikit-learn version:', sk.__version__)
print('matplotlib version:', matplotlib.__version__)
%matplotlib inline
numpy version: 1.21.5 pandas version: 1.4.1 scikit-learn version: 1.0.2 matplotlib version: 3.5.1
# we start loading the data and checking the first 2 rows
kamer=pd.read_json(r'C:\Users\ramya\Documents\remy\challenge\properties.json', lines=True)
kamer.head(2)
| _id | externalId | areaRaw | areaSqm | city | coverImageUrl | crawlStatus | crawledAt | datesPublished | firstSeenAt | furnish | lastSeenAt | latitude | longitude | postalCode | postedAgo | propertyType | rawAvailability | rent | rentDetail | rentRaw | source | title | url | additionalCosts | additionalCostsRaw | deposit | depositRaw | descriptionNonTranslated | descriptionNonTranslatedRaw | descriptionTranslated | descriptionTranslatedRaw | detailsCrawledAt | energyLabel | gender | internet | isRoomActive | kitchen | living | matchAge | matchAgeBackup | matchCapacity | matchGender | matchGenderBackup | matchLanguages | matchStatus | matchStatusBackup | pageDescription | pageTitle | pets | registrationCost | registrationCostRaw | roommates | shower | smokingInside | toilet | userDisplayName | userId | userLastLoggedOn | userMemberSince | userPhotoUrl | additionalCostsDescription | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | {'$oid': '5d2b113a43cbfd7c77a998f4'} | room-1686123 | 14 m2 | 14 | Rotterdam | https://resources.kamernet.nl/image/913b4b03-5... | done | {'$date': '2019-07-26T22:18:23.018+0000'} | [{'$date': '2019-07-14T11:25:46.511+0000'}, {'... | {'$date': '2019-07-14T11:25:46.511+0000'} | Unfurnished | {'$date': '2019-07-26T22:18:23.142+0000'} | 51.896601 | 4.514993 | 3074HN | 4w | Room | 26-06-'19 - Indefinite period | 500 | € 500,- | kamernet | West-Varkenoordseweg | https://kamernet.nl/en/for-rent/room-rotterdam... | 50.0 | \n € 50\n ... | 500.0 | \n € 500\n ... | Nice room for rent, accros the Feyenoord stadi... | \nNice room for rent, accros the Feyenoord sta... | Nice room for rent, accros the Feyenoord stadi... | \nNice room for rent, accros the Feyenoord sta... | {'$date': '2019-07-22T07:10:41.849+0000'} | Unknown | Mixed | Yes | true | Shared | None | 16 years -\n 99 years | 16 years -\n 99 years | 1 person | Not important | Not important | Not important | Not important | Not important | Room for rent in Rotterdam, West-Varkenoordse... | Room for rent in Rotterdam €500 | Kamernet | No | 0 | \n € 0\n ... | 5 | Shared | No | Shared | Huize west | 4680711.0 | 21-07-2019 | 26-06-2019 | https://resources.kamernet.nl/Content/images/s... | NaN | |
| 1 | {'$oid': '5d2b113a43cbfd7c77a9991a'} | studio-1691193 | 30 m2 | 30 | Amsterdam | https://resources.kamernet.nl/image/5e11d6b5-8... | done | {'$date': '2019-08-10T22:28:46.099+0000'} | [{'$date': '2019-07-14T11:25:46.677+0000'}, {'... | {'$date': '2019-07-14T11:25:46.677+0000'} | Furnished | {'$date': '2019-08-10T22:28:46.229+0000'} | 52.370200 | 4.920721 | 1018AS | 4w | Studio | 15-08-'19 - Indefinite period | 950 | Utilities incl. | € 950,- Utilities incl. | kamernet | Parelstraat | https://kamernet.nl/en/for-rent/studio-amsterd... | 0.0 | \n € 0\n ... | 895.0 | \n € 895\n ... | Efficiently furnished, with a large balcony, a... | \nEfficiently furnished, with a large balcony,... | Efficiently furnished, with a large balcony, a... | \nEfficiently furnished, with a large balcony,... | {'$date': '2019-07-22T06:29:33.112+0000'} | Unknown | Unknown | Yes | true | Own | Own | 18 years -\n 99 years | 18 years -\n 99 years | 1 person | Not important | Not important | Not important | Working student, Working | Working student, Working | Studio for rent in Amsterdam, Parelstraat, fo... | Studio for rent in Amsterdam €950 | Kamernet | No | 0 | \n € 0\n ... | None | Own | No | Own | Cor | 1865530.0 | 20-07-2019 | 05-01-2012 | https://resources.kamernet.nl/Content/images/p... | NaN |
# we start loading the data and checking the first 2 rows
huurwon=pd.read_json(r'C:\Users\ramya\Documents\remy\properties.json', orient='records')
huurwon
| url | title | location | rent | area | type | construction_year | rooms | bedrooms | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | https://www.huurwoningen.nl/huren/eygelshoven/... | Appartement Kerkraderstraat in Eygelshoven | 6471 BJ (Vink) | € 750 per maand | 80 m² | Appartement | 2003.0 | 3.0 | 2.0 |
| 1 | https://www.huurwoningen.nl/huren/gorinchem/10... | Appartement Koningin Wilhelminalaan in Gorinchem | 4205 ET (Haarwijk West) | € 665 per maand | 45 m² | Appartement | NaN | 2.0 | 1.0 |
| 2 | https://www.huurwoningen.nl/huren/enschede/120... | Appartement Korte Haaksbergerstraat in Enschede | 7511 JS (City) | € 1.090 per maand | 75 m² | Appartement | 2009.0 | 3.0 | 2.0 |
| 3 | https://www.huurwoningen.nl/huren/apeldoorn/12... | Kamer Hamelweg in Apeldoorn | 7311 EA (Brinkhorst) | € 270 per maand | 15 m² | Kamer | NaN | 1.0 | NaN |
| 4 | https://www.huurwoningen.nl/huren/rotterdam/12... | Appartement Witte de Withstraat in Rotterdam | 3012 BT (Cool) | € 1.295 per maand | 60 m² | Appartement | 1890.0 | 2.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1993 | https://www.huurwoningen.nl/huren/rotterdam/12... | Appartement Stadhoudersplein in Rotterdam | 3038 EA (Bergpolder) | € 1.250 per maand | 45 m² | Appartement | 1957.0 | 2.0 | 1.0 |
| 1994 | https://www.huurwoningen.nl/huren/rotterdam/12... | Studio Mijnsherenlaan in Rotterdam | 3081 CM (Tarwewijk) | € 750 per maand | 22 m² | Studio | 1938.0 | 2.0 | 1.0 |
| 1995 | https://www.huurwoningen.nl/huren/leeuwarden/1... | Appartement Eewal in Leeuwarden | 8911 GV (Grote Kerkbuurt) | € 1.200 per maand | 90 m² | Appartement | NaN | 3.0 | 2.0 |
| 1996 | https://www.huurwoningen.nl/huren/middelburg-z... | Appartement Gerbrandijlaan in Middelburg | 4333 BM (Klarenbeek I) | € 645 per maand | 52 m² | Appartement | NaN | 2.0 | 1.0 |
| 1997 | https://www.huurwoningen.nl/huren/utrecht/1279... | Appartement Ondiep in Utrecht | 3552 EE (Ondiep) | € 1.395 per maand | 73 m² | Appartement | 1932.0 | 4.0 | 3.0 |
1998 rows × 9 columns
before we start lets sttudy the datasets. The datasets are crawled from the websites huurwoningen and kamernet. First we will be cleaning the data from kamernet and then the one from huur and merge them.
The kamer data set 62 columns and 46722 rows. Most of these columns are non-necessary data such as: link, ID or match gender. So, to stay in our scope I chose some columns which I think are necessary and they are:
areaSqm Area of the house in square meter
city: city
longitude: the longtitude of the house
latitude : the latitude of the house
toilet: if toilets is owned or shared
shower : if the shared or owned
kitchen: if the kitchen is shared or not
living : if the living space owned or shared
propertyType : if the property is an apartment or a room
rent : rent price in euro
Postalcode : zip code of the house
And the data from huur woning has 9 columns with 1998 rows and they are :
url
Title : conatins the city name
postcode: zip code of the house
rent : rent price in euro
area : Area of the house in square meter
tye of property : if the property is an apartment or a room
consturction year : year where the house was built
rooms : number of rooms in the house
bedrooms : number of bedrooms
BEfore we start with the EDA we would like to tidy the data and make sure it is clean, so that relaiable and accurate information can be extracted.
in 2.1 we will clean the kamer dataset
#code to make a sspace in the zipcode coloumn.
kamer['postalCode'] = kamer['postalCode'].apply(lambda x:x[:4]+' '+x[-2:])
kamer.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 46722 entries, 0 to 46721 Data columns (total 62 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 _id 46722 non-null object 1 externalId 46722 non-null object 2 areaRaw 46722 non-null object 3 areaSqm 46722 non-null int64 4 city 46722 non-null object 5 coverImageUrl 46722 non-null object 6 crawlStatus 46722 non-null object 7 crawledAt 46722 non-null object 8 datesPublished 46722 non-null object 9 firstSeenAt 46722 non-null object 10 furnish 46722 non-null object 11 lastSeenAt 46722 non-null object 12 latitude 46722 non-null float64 13 longitude 46722 non-null float64 14 postalCode 46722 non-null object 15 postedAgo 46722 non-null object 16 propertyType 46722 non-null object 17 rawAvailability 46722 non-null object 18 rent 46722 non-null int64 19 rentDetail 46722 non-null object 20 rentRaw 46722 non-null object 21 source 46722 non-null object 22 title 46722 non-null object 23 url 46722 non-null object 24 additionalCosts 14301 non-null float64 25 additionalCostsRaw 46622 non-null object 26 deposit 27704 non-null float64 27 depositRaw 46622 non-null object 28 descriptionNonTranslated 46622 non-null object 29 descriptionNonTranslatedRaw 46622 non-null object 30 descriptionTranslated 46622 non-null object 31 descriptionTranslatedRaw 46622 non-null object 32 detailsCrawledAt 46722 non-null object 33 energyLabel 46622 non-null object 34 gender 45810 non-null object 35 internet 46622 non-null object 36 isRoomActive 46622 non-null object 37 kitchen 46622 non-null object 38 living 46622 non-null object 39 matchAge 46622 non-null object 40 matchAgeBackup 46622 non-null object 41 matchCapacity 46622 non-null object 42 matchGender 46622 non-null object 43 matchGenderBackup 46622 non-null object 44 matchLanguages 46622 non-null object 45 matchStatus 46622 non-null object 46 matchStatusBackup 46622 non-null object 47 pageDescription 46622 non-null object 48 pageTitle 46622 non-null object 49 pets 46622 non-null object 50 registrationCost 4688 non-null object 51 registrationCostRaw 46622 non-null object 52 roommates 45810 non-null object 53 shower 46622 non-null object 54 smokingInside 46622 non-null object 55 toilet 46622 non-null object 56 userDisplayName 46622 non-null object 57 userId 46622 non-null float64 58 userLastLoggedOn 46622 non-null object 59 userMemberSince 46622 non-null object 60 userPhotoUrl 46622 non-null object 61 additionalCostsDescription 20546 non-null object dtypes: float64(5), int64(2), object(55) memory usage: 22.1+ MB
Now we get to choose the columns needed for our model in a new data frame. AFterthat we check for null values and check the data if it was done with crawling data or not.
#creating a new dataframe to not mess with the original data
house=kamer[['areaSqm','city','longitude','latitude','toilet','shower','kitchen'
,'living','propertyType','rent','crawlStatus','postalCode']]
# unique varaibales in the column
house['crawlStatus'].unique()
array(['done', 'unavailable'], dtype=object)
#the sum of null values in the columns
house.isnull().sum()
areaSqm 0 city 0 longitude 0 latitude 0 toilet 100 shower 100 kitchen 100 living 100 propertyType 0 rent 0 crawlStatus 0 postalCode 0 dtype: int64
#in here we display the NA rrows to study why it could be missing
house[house.isnull().any(axis=1)]
| areaSqm | city | longitude | latitude | toilet | shower | kitchen | living | propertyType | rent | crawlStatus | postalCode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 77 | 13 | Den Haag | 4.300775 | 52.060570 | NaN | NaN | NaN | NaN | Room | 500 | unavailable | 2525 ZA |
| 80 | 14 | Den Haag | 4.300775 | 52.060570 | NaN | NaN | NaN | NaN | Room | 500 | unavailable | 2525 ZA |
| 91 | 20 | Ubbena | 6.589102 | 53.054733 | NaN | NaN | NaN | NaN | Room | 20 | unavailable | 9492 TG |
| 116 | 18 | Groningen | 6.572627 | 53.232240 | NaN | NaN | NaN | NaN | Room | 355 | unavailable | 9715 AN |
| 231 | 10 | Haarlem | 4.659657 | 52.365043 | NaN | NaN | NaN | NaN | Room | 350 | unavailable | 2035 VE |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 43453 | 9 | Utrecht | 5.124416 | 52.063441 | NaN | NaN | NaN | NaN | Room | 550 | unavailable | 3525 CN |
| 44914 | 26 | Rotterdam | 4.492083 | 51.894133 | NaN | NaN | NaN | NaN | Room | 585 | unavailable | 3081 CD |
| 45434 | 15 | Amsterdam | 4.876734 | 52.356464 | NaN | NaN | NaN | NaN | Room | 650 | unavailable | 1071 LE |
| 45492 | 120 | Nieuwerkerk a/d IJssel | 4.611392 | 51.959334 | NaN | NaN | NaN | NaN | Apartment | 1950 | unavailable | 2911 HA |
| 45520 | 16 | Enschede | 6.870352 | 52.209836 | NaN | NaN | NaN | NaN | Room | 275 | unavailable | 7545 XA |
100 rows × 12 columns
As it can be seen there are a lot of missing values in this data set. How are we going to approach this? If we looked further in the data we will notice that some data are not filled in or called “unknown” and If we look at the crawlstatus columns we will see that it consists of uncrawled and crawled data. So let us drop the unavailable rows which crawling couldn’t get in this column and change the values of the missing data to NAN.
# use repalce for na values and drop to drop what is not needed
house = house.replace({'': np.NaN, 'Unknown': np.NaN})
house = house.drop(house[house.crawlStatus == 'unavailable'].index)
house
| areaSqm | city | longitude | latitude | toilet | shower | kitchen | living | propertyType | rent | crawlStatus | postalCode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14 | Rotterdam | 4.514993 | 51.896601 | Shared | Shared | Shared | None | Room | 500 | done | 3074 HN |
| 1 | 30 | Amsterdam | 4.920721 | 52.370200 | Own | Own | Own | Own | Studio | 950 | done | 1018 AS |
| 2 | 11 | Amsterdam | 4.854786 | 52.350880 | Shared | Shared | Shared | Shared | Room | 1000 | done | 1075 SB |
| 3 | 16 | Assen | 6.561012 | 53.013494 | Shared | Shared | Shared | None | Room | 290 | done | 9407 BG |
| 4 | 22 | Rotterdam | 4.479732 | 51.932871 | Shared | Shared | Own | Own | Room | 475 | done | 3035 AK |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 46717 | 28 | Rotterdam | 4.507187 | 51.928624 | Shared | Shared | Shared | Shared | Room | 800 | done | 3061 AG |
| 46718 | 16 | Harmelen | 4.959942 | 52.086568 | Shared | Shared | Shared | Shared | Room | 400 | done | 3481 VE |
| 46719 | 30 | Rotterdam | 4.507187 | 51.928624 | Shared | Shared | Shared | Shared | Room | 950 | done | 3061 AG |
| 46720 | 35 | Rotterdam | 4.507187 | 51.928624 | Shared | Shared | Shared | Shared | Room | 1050 | done | 3061 AG |
| 46721 | 25 | Rotterdam | 4.463528 | 51.921077 | Shared | Shared | Shared | Own | Studio | 738 | done | 3014 CD |
46622 rows × 12 columns
#checking for null values
house.isnull().sum()
areaSqm 0 city 0 longitude 0 latitude 0 toilet 7660 shower 7631 kitchen 7628 living 8551 propertyType 0 rent 0 crawlStatus 0 postalCode 0 dtype: int64
If we look at the data now we notice that the number of NAN values have increased, and that is because we replaced any empty strings or unknown strings to a nan. how do we still deal with this problem. Lets try filtering the city for Amsterdam only and check again. The reason why we filter for amsterdam through a string that contains the word "Amsterdam", is because we are covering the whole area of amsterdam, so there might be places that are called different but still in amsterdam for example, "Amsterdam zuid".
#filter for amsterdam
house=house[house['city'].str.contains("Amsterdam")]
#check for null values
house.isnull().sum()
areaSqm 0 city 0 longitude 0 latitude 0 toilet 998 shower 982 kitchen 980 living 1124 propertyType 0 rent 0 crawlStatus 0 postalCode 0 dtype: int64
As it can be observed the NAN values have decreased. So, time to impute. Most of these data are missing by random, as they can be predicted using other variables and that is due to them not being independents. I would choose to impute using the mode as it is the most frequently repeated value For every column and the reason behind the data missing is due to when crawled not all information were retrieved
# using the MSNO library we could plot a matrix representing the missing data
msno.matrix(house)
<AxesSubplot:>
This is a different way of visualizing the missing data in our dataset. As it can be seen between all the missing columns there is a similar pattern in the missing values, where they share the same distribution of missing data. If we look at the spark line on the very right we see that there are 2 numbers 8 and 12. This tells us that there are minimally 8 columns with full data and maximum of 12 columns which are less filled with data.
# to see the correlation between the missing values and they all seems to have a high correlation.
msno.heatmap(house)
<AxesSubplot:>
#in here we check for uniqe values in the column and plot it on a bar plot
print(house['toilet'].unique())
house['toilet'].value_counts().plot.bar()
['Own' 'Shared' nan 'None']
<AxesSubplot:>
#in here we check for uniqe values in the column and plot it on a bar plot
print(house['shower'].unique())
house['shower'].value_counts().plot.bar()
['Own' 'Shared' nan 'None']
<AxesSubplot:>
#in here we check for uniqe values in the column and plot it on a bar plot
print(house['living'].unique())
house['living'].value_counts().plot.bar()
['Own' 'Shared' nan 'None']
<AxesSubplot:>
#in here we check for uniqe values in the column and plot it on a bar plot
print(house['kitchen'].unique())
house['kitchen'].value_counts().plot.bar()
['Own' 'Shared' nan 'None']
<AxesSubplot:>
#we move the data into a new dataframe not not mess witht the original data
data=house
The reason behind why we impute with the mode, is because when i did some research the data wasnt compelety fully crawled due to the website of kamer being protected by json files which prevents crawling. Therefore, it was observed that there are houses with shared toilet, shared kitchen and shower, but living is missing as NA. Based on that It was decicded to impute witht he mode. In addition to that, since that the data is categorical and missing at random the best apprpoach is to impute with the mode. Unlike the others which needs the variable to be numerical. The limits of this decision is that the results might be biased.
#in here we check the mode and fill the na using the mode
print(data['toilet'].mode())
data['toilet'].fillna('Shared',inplace=True)
data['toilet'].isnull().sum()
0 Shared Name: toilet, dtype: object
C:\Users\ramya\AppData\Local\Temp\ipykernel_34936\108842338.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['toilet'].fillna('Shared',inplace=True)
0
#in here we check the mode and fill the na using the mode
print(data['shower'].mode())
data['shower'].fillna('Shared',inplace=True)
data['shower'].isnull().sum()
0 Shared Name: shower, dtype: object
C:\Users\ramya\AppData\Local\Temp\ipykernel_34936\933807525.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['shower'].fillna('Shared',inplace=True)
0
#in here we check the mode and fill the na using the mode
print(data['living'].mode())
data['living'].fillna('Shared',inplace=True)
data['living'].isnull().sum()
0 Shared Name: living, dtype: object
C:\Users\ramya\AppData\Local\Temp\ipykernel_34936\1770133214.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['living'].fillna('Shared',inplace=True)
0
#in here we check the mode and fill the na using the mode
print(data['kitchen'].mode())
data['kitchen'].fillna('Shared',inplace=True)
data['kitchen'].isnull().sum()
0 Shared Name: kitchen, dtype: object
C:\Users\ramya\AppData\Local\Temp\ipykernel_34936\2332728254.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['kitchen'].fillna('Shared',inplace=True)
0
data.isnull().sum()
areaSqm 0 city 0 longitude 0 latitude 0 toilet 0 shower 0 kitchen 0 living 0 propertyType 0 rent 0 crawlStatus 0 postalCode 0 dtype: int64
Here we will be cleaning the dataset from huuurwoning
#in here we drop the columns that we dont need anymore
huurwon=huurwon.drop('url',axis=1)
huurwon=huurwon.drop('construction_year',axis=1)
huurwon
| title | location | rent | area | type | rooms | bedrooms | |
|---|---|---|---|---|---|---|---|
| 0 | Appartement Kerkraderstraat in Eygelshoven | 6471 BJ (Vink) | € 750 per maand | 80 m² | Appartement | 3.0 | 2.0 |
| 1 | Appartement Koningin Wilhelminalaan in Gorinchem | 4205 ET (Haarwijk West) | € 665 per maand | 45 m² | Appartement | 2.0 | 1.0 |
| 2 | Appartement Korte Haaksbergerstraat in Enschede | 7511 JS (City) | € 1.090 per maand | 75 m² | Appartement | 3.0 | 2.0 |
| 3 | Kamer Hamelweg in Apeldoorn | 7311 EA (Brinkhorst) | € 270 per maand | 15 m² | Kamer | 1.0 | NaN |
| 4 | Appartement Witte de Withstraat in Rotterdam | 3012 BT (Cool) | € 1.295 per maand | 60 m² | Appartement | 2.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1993 | Appartement Stadhoudersplein in Rotterdam | 3038 EA (Bergpolder) | € 1.250 per maand | 45 m² | Appartement | 2.0 | 1.0 |
| 1994 | Studio Mijnsherenlaan in Rotterdam | 3081 CM (Tarwewijk) | € 750 per maand | 22 m² | Studio | 2.0 | 1.0 |
| 1995 | Appartement Eewal in Leeuwarden | 8911 GV (Grote Kerkbuurt) | € 1.200 per maand | 90 m² | Appartement | 3.0 | 2.0 |
| 1996 | Appartement Gerbrandijlaan in Middelburg | 4333 BM (Klarenbeek I) | € 645 per maand | 52 m² | Appartement | 2.0 | 1.0 |
| 1997 | Appartement Ondiep in Utrecht | 3552 EE (Ondiep) | € 1.395 per maand | 73 m² | Appartement | 4.0 | 3.0 |
1998 rows × 7 columns
#in here we check the percentage of the null values 16% of missing values is noticed in the column of bedroooms
nullvalues=(huurwon.isna().sum()/huurwon.shape[0])*100
nullvalues
title 0.000000 location 0.000000 rent 0.000000 area 0.150150 type 0.000000 rooms 1.451451 bedrooms 20.820821 dtype: float64
#inhere we check the type of the columns
huurwon.info()
# it can be seen the rent is set to object instead of an int same goes for rooms and bedrooms as they should be integers and not float
#so that it takes less space.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1998 entries, 0 to 1997 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 1998 non-null object 1 location 1998 non-null object 2 rent 1998 non-null object 3 area 1995 non-null object 4 type 1998 non-null object 5 rooms 1969 non-null float64 6 bedrooms 1582 non-null float64 dtypes: float64(2), object(5) memory usage: 109.4+ KB
Lets first start cleaning the data before changing the data type. we will start with rent as the column rent first index has a euro sign and the last 9 indexes conatins useless charachters which needs to be removed and that is done below.
# in here we remove the first index and then we remove the last 9 indexes
huurwon.rent=huurwon.rent.str[1:]
huurwon.rent=huurwon.rent.str[:-9]
The title column will have to change to city name as this column contains the title of the house and the city name. Since that we only need the city name we will have to strip all other charachters and keep only the city name
#in here we strip only the last 9 indexes as we are going to filter later for the city of amsterdam and amsterdam is 9 letters
huurwon.title=huurwon.title.str[-9:]
#now we check for uniek values
huurwon.title.unique()
array(['gelshoven', 'Gorinchem', ' Enschede', 'Apeldoorn', 'Rotterdam',
'Dordrecht', ' Den Haag', 'isterwijk', 'in Arnhem', 'n Haarlem',
'n op Zoom', ' in Stein', 'eterwoude', ' in Hoorn', 'Amsterdam',
'Den Bosch', ' in Assen', 'g in Velp', 'n Zutphen', 'in Almere',
'uiderberg', 'Groningen', 'Eindhoven', 'Terneuzen', 'n Zaandam',
' Voorhout', 'aksbergen', 'in Leiden', 'lytushoef', 'ieuwegein',
'in Boxtel', 'Wapenveld', 'n Tilburg', 'in Geleen', 'n Utrecht',
'in Beilen', 'Rozenburg', ' in Delft', 'Beverwijk', 'n Leerdam',
'hoonhoven', 'n Ter Aar', 'nbeemster', 'in Marken', 'aastricht',
'oosendaal', 'ud Gastel', 'Hilversum', ' in Ingen', 'in Zwolle',
'sselstein', 'mstelveen', 'n Heerlen', ' Voorburg', ' Blaricum',
' Rijswijk', 'r en Keer', 'n Woerden', 'oort-Zuid', 'Hoogeveen',
'Barneveld', ' Schiedam', 'n Sittard', ' Waalwijk', ' in Leuth',
' Haren Gn', ' in Breda', 'in Eersel', 'Werkendam', 'uijbergen',
'eenbergen', ' Hillegom', ' Deventer', 'olenschot', 'Hoofddorp',
' in Gouda', 'Noordwijk', 'Bilthoven', 'aardingen', ' Lelystad',
'ijkenisse', 'en IJssel', 'in Houten', 'mersfoort', 'at in Oss',
'l in Velp', ' Nijmegen', 'Doorwerth', 'in Diemen', ' Roermond',
'f in Tiel', ' Doesburg', 'eiderdorp', 'in Veghel', 'adskanaal',
' in Walem', 'in Almelo', 'Babberich', 'osterhout', 'n Brielle',
'f in Goes', 'ijnsplaat', 'in Haelen', 'n Geldrop', ' in Emmen',
'n Tegelen', 'rgen (NH)', 'wanenburg', 'Heemskerk', 'lissingen',
'n Nijkerk', 'Purmerend', 'Enkhuizen', 'Pijnacker', 'en Helder',
'n Hengelo', 'Veldhoven', 'egstgeest', 'Haarsteeg', 'ijsenburg',
' Scheemda', 'iepenveen', 'g in Wilp', 't in Velp', 'eijningen',
'Heemstede', 'rtensdijk', 'n Dalfsen', 'iddelburg', 'oetermeer',
'n Klimmen', 'Beinsdorp', 'insenbeek', 'B in Norg', ' in Lisse',
'in Andijk', ' Rosmalen', 'eg in Oss', 'lminadorp', 'odegraven',
'Nieuwveen', 'n Alkmaar', ' in Weesp', 'Steenwijk', 'Culemborg',
'ukeleveen', 'in Reusel', 'eeuwarden', ' Delfgauw', ' in Weert',
'udenbosch', 'an in Ede', 'n Abcoude', 'dschendam', 'euwleusen',
'n De Zilk', 'n op Geul', 'ddinxveen', ' Oirschot', 'in Waalre',
'uitenpost', 'n Breugel', 'ijkerhout', 'oenderloo', ' in Baarn',
' IJmuiden', 'Coevorden', 'n Vleuten', 'liedrecht', ' in Zeist',
'arderwijk', 'alkenburg', 'n in Beek', 'ederweert', 'n Katwijk',
'in Heesch', 'inschoten', 'Venhuizen', 'n Kapelle', 'lasserdam',
'Emmeloord', 'tten-Leur', 'at in Ede', 'ijndrecht', ' in Vught',
'Schinveld', 'in Goirle', 't in Echt', 'bbenvorst', 'Panningen',
' in Horst', 'hipluiden', 'in Duizel', 'Nijverdal', 'n Elspeet',
'ageningen', 'in Bussum', ' in Doorn', 'in Venray', 'k en Donk',
'vendrecht', 'gersmilde', 'Wassenaar', ' Terblijt', ' in Thorn',
' Hoogland', ' Ugchelen', 'in Vianen', ' in Sneek', 'n Winssen',
'Landgraaf', 'eg in Ede', 'in Dieren', ' in Laren', 'ijsbergen',
' Oostburg', ' in Vaals', ' Susteren', 'in Geffen', 'oensbroek',
' in Venlo', 'o-Ambacht', ' De Meern', 'hoevedorp', 'Vinkeveen',
'den Hoorn', ' in Bedum', ' Meerssen', ' Brunssum', ' Jansteen',
'n Schagen', 'in Huizen', 'in Delden', 'in Clinge', 'in Wormer',
' Katwoude', 'kenswaard', 'nickendam', ' Kerkrade', 'in Vinkel',
'De Kwakel', 'ogkarspel', 'Groenekan', 'Ulvenhout', 'ennebroek',
' in Soest', 'schenhoek', 'Krommenie', 'hugowaard', 'Werkhoven',
'Dirksland', 'uw-Vennep', 'eerenveen', 'Gelselaar', 'n Helmond',
'iddenmeer', 'in Gemert', 'Zandvoort', ' in Neede', 'Oldebroek',
'Hulshorst', ' Maarssen', 'Groesbeek', 't in Goes', 'Biervliet',
' Zevenaar', 'p in Wouw', ' Aalsmeer', 'n Warmond', ' in Bunde',
'in Leende', ' Kootwijk', 'in Nuenen', 'mmerzoden', 'h en Duin',
'Haamstede', ' Avenhorn', 't in Nuth', 'n Monster', 'n in Velp',
'urg Noord', 'an de Lek', 'in Meppel', ' den Rijn', ' Delfzijl',
' de Vecht', 'in in Ede', 'oge Hexel', 'n de Zaan', 'n de Wijk',
'g in Olst', 'sterwolde', 'inghuizen', ' in Cuijk', 'Zierikzee',
' in Budel', 'oordlaren', 'in Ezinge'], dtype=object)
As it can be seen we have succeded in stripping all the charachters except the last 9 which has the same count as amsterdam. now we have to filter for amsterdam city and check if there are any other values that we missed
#in here we look for where title contains the sting amsterdam
huurwon=huurwon[huurwon['title'].str.contains("Amsterdam")]
print(huurwon.title.unique())
huurwon
['Amsterdam']
| title | location | rent | area | type | rooms | bedrooms | |
|---|---|---|---|---|---|---|---|
| 19 | Amsterdam | 1077 MC (Apollobuurt) | 2.700 | 110 m² | Appartement | 3.0 | 2.0 |
| 35 | Amsterdam | 1017 TJ (De Weteringschans) | 1.550 | 50 m² | Appartement | 2.0 | 1.0 |
| 40 | Amsterdam | 1054 AL (Helmersbuurt) | 2.500 | 71 m² | Appartement | 3.0 | 2.0 |
| 56 | Amsterdam | 1058 GZ (Westindische Buurt) | 1.800 | 80 m² | Appartement | 3.0 | 2.0 |
| 61 | Amsterdam | 1097 VD (Betondorp) | 1.800 | 78 m² | Appartement | 4.0 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1939 | Amsterdam | 1094 CW (Indische Buurt West) | 1.700 | 60 m² | Appartement | 2.0 | 1.0 |
| 1952 | Amsterdam | 1015 PB (Jordaan) | 2.500 | 95 m² | Appartement | 4.0 | 2.0 |
| 1954 | Amsterdam | 1057 RM (Hoofdweg e.o.) | 1.635 | 50 m² | Appartement | 3.0 | 2.0 |
| 1959 | Amsterdam | 1066 XT (Slotervaart Zuid) | 2.750 | 163 m² | Appartement | 5.0 | NaN |
| 1974 | Amsterdam | 1054 CD (Helmersbuurt) | 1.600 | 60 m² | Appartement | 2.0 | NaN |
253 rows × 7 columns
now we move to area as we have to strip the m2 and the location as we only want the zip code
#in here we strip the last 3 indexes from area and the first 7 from location
huurwon.area=huurwon.area.str[:-3]
C:\Users\ramya\AppData\Local\Temp\ipykernel_34936\3490115561.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy huurwon.area=huurwon.area.str[:-3]
#in here we strip the first 7 indexes for the zipcode
huurwon.location=huurwon.location.str[:7]
huurwon
C:\Users\ramya\AppData\Local\Temp\ipykernel_34936\4001587665.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy huurwon.location=huurwon.location.str[:7]
| title | location | rent | area | type | rooms | bedrooms | |
|---|---|---|---|---|---|---|---|
| 19 | Amsterdam | 1077 MC | 2.700 | 110 | Appartement | 3.0 | 2.0 |
| 35 | Amsterdam | 1017 TJ | 1.550 | 50 | Appartement | 2.0 | 1.0 |
| 40 | Amsterdam | 1054 AL | 2.500 | 71 | Appartement | 3.0 | 2.0 |
| 56 | Amsterdam | 1058 GZ | 1.800 | 80 | Appartement | 3.0 | 2.0 |
| 61 | Amsterdam | 1097 VD | 1.800 | 78 | Appartement | 4.0 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1939 | Amsterdam | 1094 CW | 1.700 | 60 | Appartement | 2.0 | 1.0 |
| 1952 | Amsterdam | 1015 PB | 2.500 | 95 | Appartement | 4.0 | 2.0 |
| 1954 | Amsterdam | 1057 RM | 1.635 | 50 | Appartement | 3.0 | 2.0 |
| 1959 | Amsterdam | 1066 XT | 2.750 | 163 | Appartement | 5.0 | NaN |
| 1974 | Amsterdam | 1054 CD | 1.600 | 60 | Appartement | 2.0 | NaN |
253 rows × 7 columns
huurwon.type.unique()
array(['Appartement', 'Huis', 'Studio', 'Kamer'], dtype=object)
huurwon.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 253 entries, 19 to 1974 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 253 non-null object 1 location 253 non-null object 2 rent 253 non-null object 3 area 253 non-null object 4 type 253 non-null object 5 rooms 253 non-null float64 6 bedrooms 204 non-null float64 dtypes: float64(2), object(5) memory usage: 15.8+ KB
#in here we rename the column into city
huurwon=huurwon.rename(columns={'title':'city','location':'postalCode','area':'areaSqm','type':'propertyType'})
huurwon
| city | postalCode | rent | areaSqm | propertyType | rooms | bedrooms | |
|---|---|---|---|---|---|---|---|
| 19 | Amsterdam | 1077 MC | 2.700 | 110 | Appartement | 3.0 | 2.0 |
| 35 | Amsterdam | 1017 TJ | 1.550 | 50 | Appartement | 2.0 | 1.0 |
| 40 | Amsterdam | 1054 AL | 2.500 | 71 | Appartement | 3.0 | 2.0 |
| 56 | Amsterdam | 1058 GZ | 1.800 | 80 | Appartement | 3.0 | 2.0 |
| 61 | Amsterdam | 1097 VD | 1.800 | 78 | Appartement | 4.0 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1939 | Amsterdam | 1094 CW | 1.700 | 60 | Appartement | 2.0 | 1.0 |
| 1952 | Amsterdam | 1015 PB | 2.500 | 95 | Appartement | 4.0 | 2.0 |
| 1954 | Amsterdam | 1057 RM | 1.635 | 50 | Appartement | 3.0 | 2.0 |
| 1959 | Amsterdam | 1066 XT | 2.750 | 163 | Appartement | 5.0 | NaN |
| 1974 | Amsterdam | 1054 CD | 1.600 | 60 | Appartement | 2.0 | NaN |
253 rows × 7 columns
In this moment as i tried to change the columns type i got some error messages about the format of m column. After some investigation on how it was crawled, apparently it wasnt clean as it contains some weird characters "€ ", which arent displayed or cant be seen above due to panda library. so i had to inspect the webpage where i crawled my data and saw that these characters are presented before the price or area.
#strip thhe invisible charachters that were discovered by inspecting the page where the dataw was crawled
huurwon.rent=huurwon.rent.str.strip(" € ")
huurwon.propertyType=huurwon.propertyType.str.strip(" € ")
huurwon.rent=huurwon.rent.str.replace(".","")
huurwon
C:\Users\ramya\AppData\Local\Temp\ipykernel_34936\1079939340.py:4: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
huurwon.rent=huurwon.rent.str.replace(".","")
| city | postalCode | rent | areaSqm | propertyType | rooms | bedrooms | |
|---|---|---|---|---|---|---|---|
| 19 | Amsterdam | 1077 MC | 2700 | 110 | Appartement | 3.0 | 2.0 |
| 35 | Amsterdam | 1017 TJ | 1550 | 50 | Appartement | 2.0 | 1.0 |
| 40 | Amsterdam | 1054 AL | 2500 | 71 | Appartement | 3.0 | 2.0 |
| 56 | Amsterdam | 1058 GZ | 1800 | 80 | Appartement | 3.0 | 2.0 |
| 61 | Amsterdam | 1097 VD | 1800 | 78 | Appartement | 4.0 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1939 | Amsterdam | 1094 CW | 1700 | 60 | Appartement | 2.0 | 1.0 |
| 1952 | Amsterdam | 1015 PB | 2500 | 95 | Appartement | 4.0 | 2.0 |
| 1954 | Amsterdam | 1057 RM | 1635 | 50 | Appartement | 3.0 | 2.0 |
| 1959 | Amsterdam | 1066 XT | 2750 | 163 | Appartement | 5.0 | NaN |
| 1974 | Amsterdam | 1054 CD | 1600 | 60 | Appartement | 2.0 | NaN |
253 rows × 7 columns
huurwon.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 253 entries, 19 to 1974 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city 253 non-null object 1 postalCode 253 non-null object 2 rent 253 non-null object 3 areaSqm 253 non-null object 4 propertyType 253 non-null object 5 rooms 253 non-null float64 6 bedrooms 204 non-null float64 dtypes: float64(2), object(5) memory usage: 15.8+ KB
#check for NAs
huurwon.isnull().sum()
city 0 postalCode 0 rent 0 areaSqm 0 propertyType 0 rooms 0 bedrooms 49 dtype: int64
#in here we display the NA rrows to study why it could be missing
huurwon[huurwon.isnull().any(axis=1)]
| city | postalCode | rent | areaSqm | propertyType | rooms | bedrooms | |
|---|---|---|---|---|---|---|---|
| 114 | Amsterdam | 1091 SW | 1795 | 65 | Appartement | 2.0 | NaN |
| 128 | Amsterdam | 1017 HA | 1500 | 44 | Appartement | 3.0 | NaN |
| 129 | Amsterdam | 1098 WL | 1850 | 91 | Appartement | 4.0 | NaN |
| 357 | Amsterdam | 1012 SE | 1650 | 45 | Studio | 1.0 | NaN |
| 358 | Amsterdam | 1024 AT | 1795 | 85 | Appartement | 3.0 | NaN |
| 370 | Amsterdam | 1077 JE | 2250 | 110 | Appartement | 3.0 | NaN |
| 464 | Amsterdam | 1079 AC | 2050 | 108 | Appartement | 3.0 | NaN |
| 487 | Amsterdam | 1017 WB | 1900 | 94 | Appartement | 3.0 | NaN |
| 488 | Amsterdam | 1068 PL | 1525 | 90 | Appartement | 3.0 | NaN |
| 489 | Amsterdam | 1103 MK | 1595 | 115 | Hui | 5.0 | NaN |
| 507 | Amsterdam | 1082 MK | 1650 | 46 | Appartement | 1.0 | NaN |
| 597 | Amsterdam | 1083 TG | 1750 | 73 | Appartement | 3.0 | NaN |
| 599 | Amsterdam | 1077 DE | 2100 | 80 | Appartement | 3.0 | NaN |
| 607 | Amsterdam | 1094 LC | 1500 | 70 | Appartement | 2.0 | NaN |
| 608 | Amsterdam | 1093 XR | 1750 | 55 | Appartement | 2.0 | NaN |
| 610 | Amsterdam | 1054 GX | 2200 | 65 | Appartement | 2.0 | NaN |
| 634 | Amsterdam | 1011 BJ | 2250 | 105 | Appartement | 2.0 | NaN |
| 639 | Amsterdam | 1091 CL | 1750 | 65 | Appartement | 2.0 | NaN |
| 642 | Amsterdam | 1087 KE | 2750 | 170 | Hui | 5.0 | NaN |
| 717 | Amsterdam | 1103 DV | 1595 | 111 | Appartement | 4.0 | NaN |
| 718 | Amsterdam | 1068 PJ | 1495 | 96 | Appartement | 3.0 | NaN |
| 722 | Amsterdam | 1013 GS | 1750 | 60 | Appartement | 2.0 | NaN |
| 749 | Amsterdam | 1022 KE | 1500 | 70 | Appartement | 2.0 | NaN |
| 799 | Amsterdam | 1082 VB | 1800 | 88 | Appartement | 3.0 | NaN |
| 827 | Amsterdam | 1012 VM | 2000 | 65 | Appartement | 2.0 | NaN |
| 839 | Amsterdam | 1076 BK | 2325 | 100 | Appartement | 3.0 | NaN |
| 866 | Amsterdam | 1011 EL | 2750 | 85 | Appartement | 2.0 | NaN |
| 875 | Amsterdam | 1019 PE | 2000 | 120 | Appartement | 3.0 | NaN |
| 877 | Amsterdam | 1079 GL | 2300 | 70 | Appartement | 3.0 | NaN |
| 909 | Amsterdam | 1031 HD | 1850 | 48 | Appartement | 1.0 | NaN |
| 920 | Amsterdam | 1095 MB | 1400 | 35 | Appartement | 1.0 | NaN |
| 1014 | Amsterdam | 1012 SW | 2150 | 65 | Appartement | 3.0 | NaN |
| 1044 | Amsterdam | 1073 XZ | 2100 | 75 | Appartement | 2.0 | NaN |
| 1195 | Amsterdam | 1012 VG | 1250 | 30 | Appartement | 1.0 | NaN |
| 1196 | Amsterdam | 1094 GM | 1375 | 30 | Appartement | 1.0 | NaN |
| 1203 | Amsterdam | 1012 NC | 2550 | 75 | Appartement | 3.0 | NaN |
| 1340 | Amsterdam | 1073 GH | 2500 | 88 | Appartement | 3.0 | NaN |
| 1348 | Amsterdam | 1052 BT | 2200 | 80 | Appartement | 3.0 | NaN |
| 1371 | Amsterdam | 1082 TN | 2100 | 70 | Appartement | 3.0 | NaN |
| 1372 | Amsterdam | 1051 BR | 1450 | 45 | Appartement | 2.0 | NaN |
| 1400 | Amsterdam | 1074 ET | 1700 | 45 | Appartement | 2.0 | NaN |
| 1416 | Amsterdam | 1054 XX | 1850 | 65 | Appartement | 3.0 | NaN |
| 1497 | Amsterdam | 1104 BC | 1595 | 120 | Appartement | 4.0 | NaN |
| 1620 | Amsterdam | 1057 BJ | 1950 | 60 | Appartement | 3.0 | NaN |
| 1668 | Amsterdam | 1067 WP | 2050 | 90 | Appartement | 4.0 | NaN |
| 1715 | Amsterdam | 1083 GK | 1500 | 54 | Appartement | 2.0 | NaN |
| 1793 | Amsterdam | 1054 DZ | 2450 | 85 | Appartement | 2.0 | NaN |
| 1959 | Amsterdam | 1066 XT | 2750 | 163 | Appartement | 5.0 | NaN |
| 1974 | Amsterdam | 1054 CD | 1600 | 60 | Appartement | 2.0 | NaN |
house.propertyType.value_counts()
Room 5189 Apartment 2287 Studio 594 Anti-squat 3 Student residence 1 Name: propertyType, dtype: int64
#count the values of the each type
huurwon.propertyType.value_counts()
Appartement 238 Hui 10 Studio 3 Kamer 2 Name: propertyType, dtype: int64
Below we start imputing the NA and for houses with 1 room it is logically to have 1 bedroom too and if we checked on the website, it doesnt include the bedrooms as the bedroom is included in the room.
#here we make a conditional statment of when room =1 then the bedrooms we fill the na into 1
huurwon.loc[huurwon.rooms ==1 , 'bedrooms'] = huurwon.loc[huurwon.rooms ==1 , 'bedrooms'].fillna(1)
#in here we check if it actully filled the na for only 1 room and check how the rooms are distributed for the huis type
huurwon[(huurwon['propertyType']=='Hui' )& (huurwon['rooms']>1 )]
| city | postalCode | rent | areaSqm | propertyType | rooms | bedrooms | |
|---|---|---|---|---|---|---|---|
| 226 | Amsterdam | 1067 PX | 2250 | 140 | Hui | 5.0 | 4.0 |
| 230 | Amsterdam | 1033 CJ | 1950 | 90 | Hui | 3.0 | 2.0 |
| 489 | Amsterdam | 1103 MK | 1595 | 115 | Hui | 5.0 | NaN |
| 642 | Amsterdam | 1087 KE | 2750 | 170 | Hui | 5.0 | NaN |
| 715 | Amsterdam | 1069 LW | 2950 | 205 | Hui | 4.0 | 3.0 |
| 969 | Amsterdam | 1097 XH | 2049 | 40 | Hui | 2.0 | 1.0 |
| 992 | Amsterdam | 1103 AR | 2000 | 140 | Hui | 5.0 | 4.0 |
| 1097 | Amsterdam | 1068 GL | 2250 | 100 | Hui | 3.0 | 2.0 |
| 1322 | Amsterdam | 1068 GL | 2250 | 100 | Hui | 3.0 | 2.0 |
| 1792 | Amsterdam | 1017 WH | 4950 | 125 | Hui | 4.0 | 3.0 |
It seems like there is a pattern here as it can be seen above. If we have n number of rooms then we have n-1 number of bedrooms.
an example could be a house with 5 rooms has 5-1 bedrooms. following this logic, it seems logial to impute every missing value with n -1.
#here we make a conditional statment of when room =1 then the bedrooms we fill the na into 1
huurwon.loc[huurwon.propertyType =='Hui' , 'bedrooms'] = huurwon.loc[huurwon.propertyType =='Hui' , 'bedrooms'].fillna(4)
Now we look at the appartment type and study the missing data
#in here we check if it actully filled the na for only 1 room and check how the rooms are distributed for the appartment type
huurwon[(huurwon['propertyType']=='Appartement' )& (huurwon['rooms']>3 )]
| city | postalCode | rent | areaSqm | propertyType | rooms | bedrooms | |
|---|---|---|---|---|---|---|---|
| 61 | Amsterdam | 1097 VD | 1800 | 78 | Appartement | 4.0 | 2.0 |
| 129 | Amsterdam | 1098 WL | 1850 | 91 | Appartement | 4.0 | NaN |
| 314 | Amsterdam | 1078 JR | 2450 | 125 | Appartement | 4.0 | 3.0 |
| 463 | Amsterdam | 1054 KX | 2450 | 104 | Appartement | 4.0 | 3.0 |
| 538 | Amsterdam | 1017 ZL | 3450 | 110 | Appartement | 4.0 | 3.0 |
| 619 | Amsterdam | 1063 GZ | 1600 | 70 | Appartement | 4.0 | 3.0 |
| 659 | Amsterdam | 1078 KP | 3400 | 118 | Appartement | 4.0 | 3.0 |
| 717 | Amsterdam | 1103 DV | 1595 | 111 | Appartement | 4.0 | NaN |
| 755 | Amsterdam | 1077 LS | 2850 | 130 | Appartement | 5.0 | 2.0 |
| 814 | Amsterdam | 1053 KT | 3250 | 100 | Appartement | 4.0 | 3.0 |
| 841 | Amsterdam | 1025 ZW | 1850 | 87 | Appartement | 4.0 | 3.0 |
| 915 | Amsterdam | 1053 TA | 1600 | 48 | Appartement | 4.0 | 3.0 |
| 933 | Amsterdam | 1094 CW | 2750 | 87 | Appartement | 4.0 | 3.0 |
| 961 | Amsterdam | 1017 BB | 3100 | 123 | Appartement | 4.0 | 2.0 |
| 963 | Amsterdam | 1078 GH | 1950 | 83 | Appartement | 4.0 | 2.0 |
| 1060 | Amsterdam | 1083 JM | 1950 | 95 | Appartement | 4.0 | 2.0 |
| 1065 | Amsterdam | 1098 WL | 1950 | 95 | Appartement | 4.0 | 3.0 |
| 1131 | Amsterdam | 1102 EB | 2000 | 93 | Appartement | 4.0 | 3.0 |
| 1136 | Amsterdam | 1055 AD | 2350 | 96 | Appartement | 4.0 | 3.0 |
| 1170 | Amsterdam | 1098 WR | 1750 | 70 | Appartement | 4.0 | 3.0 |
| 1229 | Amsterdam | 1071 NL | 4500 | 122 | Appartement | 4.0 | 3.0 |
| 1262 | Amsterdam | 1054 TD | 2000 | 75 | Appartement | 4.0 | 2.0 |
| 1380 | Amsterdam | 1071 WS | 2050 | 83 | Appartement | 4.0 | 2.0 |
| 1477 | Amsterdam | 1098 PX | 2200 | 92 | Appartement | 4.0 | 2.0 |
| 1497 | Amsterdam | 1104 BC | 1595 | 120 | Appartement | 4.0 | NaN |
| 1512 | Amsterdam | 1052 BV | 3950 | 145 | Appartement | 7.0 | 3.0 |
| 1566 | Amsterdam | 1058 BC | 2500 | 165 | Appartement | 5.0 | 3.0 |
| 1608 | Amsterdam | 1012 DC | 2200 | 93 | Appartement | 4.0 | 2.0 |
| 1629 | Amsterdam | 1095 JJ | 2050 | 133 | Appartement | 4.0 | 3.0 |
| 1668 | Amsterdam | 1067 WP | 2050 | 90 | Appartement | 4.0 | NaN |
| 1747 | Amsterdam | 1054 GD | 3000 | 139 | Appartement | 4.0 | 2.0 |
| 1750 | Amsterdam | 1077 HR | 3000 | 121 | Appartement | 5.0 | 2.0 |
| 1780 | Amsterdam | 1055 NA | 1750 | 140 | Appartement | 4.0 | 2.0 |
| 1832 | Amsterdam | 1053 BE | 3500 | 122 | Appartement | 4.0 | 3.0 |
| 1845 | Amsterdam | 1054 AW | 2950 | 120 | Appartement | 4.0 | 2.0 |
| 1952 | Amsterdam | 1015 PB | 2500 | 95 | Appartement | 4.0 | 2.0 |
| 1959 | Amsterdam | 1066 XT | 2750 | 163 | Appartement | 5.0 | NaN |
Looking at the appartments type, the data isnt missing due to the crawling, but it is actaully missing from the website it self. AS a proof of that look at the table above with ID 129 and compare it with the given website"https://www.huurwoningen.nl/huren/amsterdam/1282353/esplanade-de-meer/" and that is considered as missing at random data.
#drop for nnow
huurwon=huurwon.dropna()
huurwon.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 212 entries, 19 to 1954 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city 212 non-null object 1 postalCode 212 non-null object 2 rent 212 non-null object 3 areaSqm 212 non-null object 4 propertyType 212 non-null object 5 rooms 212 non-null float64 6 bedrooms 212 non-null float64 dtypes: float64(2), object(5) memory usage: 13.2+ KB
#now we change the types of the columns to the correct type
#here we change the rent from object to int
huurwon.rent=huurwon.rent.astype(int)
#here we change the area from object to int
huurwon.area=huurwon.areaSqm.astype(int)
#here we change number of rooms to int
huurwon.rooms=huurwon.rooms.astype(int)
#and here we change the number of bedrooms to int
huurwon.bedrooms=huurwon.bedrooms.astype(int)
huurwon.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 212 entries, 19 to 1954 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city 212 non-null object 1 postalCode 212 non-null object 2 rent 212 non-null int32 3 areaSqm 212 non-null object 4 propertyType 212 non-null object 5 rooms 212 non-null int32 6 bedrooms 212 non-null int32 dtypes: int32(3), object(4) memory usage: 10.8+ KB
C:\Users\ramya\AppData\Local\Temp\ipykernel_34936\1030876999.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy huurwon.rent=huurwon.rent.astype(int) C:\Users\ramya\AppData\Local\Temp\ipykernel_34936\1030876999.py:5: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access huurwon.area=huurwon.areaSqm.astype(int) C:\Users\ramya\AppData\Local\Temp\ipykernel_34936\1030876999.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy huurwon.rooms=huurwon.rooms.astype(int) C:\Users\ramya\AppData\Local\Temp\ipykernel_34936\1030876999.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy huurwon.bedrooms=huurwon.bedrooms.astype(int)
#in here we change the row names of the type column to match the same of the other data set
huurwon['propertyType']=huurwon['propertyType'].replace(['Hui','Kamer','Appartement','Studio'],['House','Room','Apartment','Studio'])
C:\Users\ramya\AppData\Local\Temp\ipykernel_34936\3662581421.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy huurwon['propertyType']=huurwon['propertyType'].replace(['Hui','Kamer','Appartement','Studio'],['House','Room','Apartment','Studio'])
#here we check if it actulayy changed
huurwon
| city | postalCode | rent | areaSqm | propertyType | rooms | bedrooms | |
|---|---|---|---|---|---|---|---|
| 19 | Amsterdam | 1077 MC | 2700 | 110 | Apartment | 3 | 2 |
| 35 | Amsterdam | 1017 TJ | 1550 | 50 | Apartment | 2 | 1 |
| 40 | Amsterdam | 1054 AL | 2500 | 71 | Apartment | 3 | 2 |
| 56 | Amsterdam | 1058 GZ | 1800 | 80 | Apartment | 3 | 2 |
| 61 | Amsterdam | 1097 VD | 1800 | 78 | Apartment | 4 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1923 | Amsterdam | 1091 GR | 3250 | 40 | Apartment | 1 | 1 |
| 1926 | Amsterdam | 1071 VW | 1850 | 65 | Apartment | 2 | 1 |
| 1939 | Amsterdam | 1094 CW | 1700 | 60 | Apartment | 2 | 1 |
| 1952 | Amsterdam | 1015 PB | 2500 | 95 | Apartment | 4 | 2 |
| 1954 | Amsterdam | 1057 RM | 1635 | 50 | Apartment | 3 | 2 |
212 rows × 7 columns
#here we loook at the number of unqie post codes for the kamer dataset
house.postalCode.nunique()
3243
#here we loook at the number of unqie post codes for the huurwoning dataset
huurwon.postalCode.nunique()
190
In here we will try enhancing the dataset from kamer with the data set from huurwonning and funda. So that we can enrich the dataset with more data, so that we can have a better accuracy.
#here we merge both datasets on the common/ inner data that they share on property type and postcode
new =pd.merge(house,huurwon,how='inner',left_on=['propertyType','postalCode'],right_on=['propertyType','postalCode'], suffixes=('','_del'))
#here we drop the duplicated columns that happen when we merge
new.drop([i for i in new.columns if '_del' in i],axis=1,inplace=True)
#we check the shape of the data
new.shape
(99, 14)
IT is noticed that as we enhanced the dataset with the number of rooms and bedrooms we reduced the amount of our data from 8000to 99 rows. Lets try to look at a third dataset and check if we can increase the amount of rows.
#loading funda dataset
funda=pd.read_json(r'C:\Users\ramya\Documents\remy\challenge\amsterdam_sold2.json', orient='records')
funda
| year_built | area | url | price | bedrooms | sale_date | postal_code | rooms | address | posting_date | property_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1906 | 108 | http://www.funda.nl/koop/verkocht/amsterdam/ap... | 510000 | 2 | 23-6-2016 | 1013 TN | 3 | Knollendamstraat 4 III/IV | 4-6-2016 | apartment |
| 1 | 1938 | 47 | http://www.funda.nl/koop/verkocht/amsterdam/ap... | 215000 | 1 | 5-7-2016 | 1079 XM | 2 | Moerdijkstraat 47 1 | 22-6-2016 | apartment |
| 2 | 2003 | 116 | http://www.funda.nl/koop/verkocht/amsterdam/ap... | 325000 | 2 | 8-7-2016 | 1095 AD | 3 | Zeeburgerdijk 349 | 9-6-2016 | apartment |
| 3 | 1910 | 58 | http://www.funda.nl/koop/verkocht/amsterdam/ap... | 315000 | 2 | 2-6-2016 | 1054 VH | 3 | Brederodestraat 124 -1 | 14-5-2016 | apartment |
| 4 | 1906 | 63 | http://www.funda.nl/koop/verkocht/amsterdam/ap... | 200000 | 1 | 23-6-2016 | 1055 MD | 3 | Admiraal De Ruijterweg 409 III | 14-6-2016 | apartment |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11438 | 1938 | 113 | http://www.funda.nl/koop/verkocht/amsterdam/ap... | 599000 | 3 | 3-6-2015 | 1077 CS | 5 | Olympiaplein 83 -III | 22-5-2015 | apartment |
| 11439 | 1993 | 88 | http://www.funda.nl/koop/verkocht/amsterdam/ap... | 399000 | 2 | 29-5-2015 | 1015 NH | 3 | Anjeliersstraat 20 | 12-5-2015 | apartment |
| 11440 | 1906 | 77 | http://www.funda.nl/koop/verkocht/amsterdam/ap... | 399000 | 2 | 19-6-2015 | 1072 GV | 4 | Rustenburgerstraat 389 I | 6-6-2015 | apartment |
| 11441 | 1931 | 90 | http://www.funda.nl/koop/verkocht/amsterdam/ap... | 259000 | 3 | 19-5-2015 | 1055 NX | 4 | Doggersbankstraat 12 II | 2-5-2015 | apartment |
| 11442 | 1937 | 54 | http://www.funda.nl/koop/verkocht/amsterdam/ap... | 165000 | 2 | 23-6-2015 | 1055 VN | 3 | Anna van Burenstraat 24 2 | 20-5-2015 | apartment |
11443 rows × 11 columns
#we merge the new dataset with the funda data set on postalcode and property type
new2 =pd.merge(new,funda,how='inner',left_on=['postalCode', 'propertyType'],right_on=['postal_code','property_type'])
new2.shape
#not used any more
(0, 25)
I tried to merge the original data set with a new data set from huurwoning, to enhance the original data set by adding new features. Afterthat, i tried to merge that merged data( new) with a new dataset from funda(funda). It is noticed that these 2 data sets do not share the same houses and the same info as the number of rows is zero, as funda is a selling houses company and not renting. So we are not going to use the 3rd database (new2)and therefore, it can be concluded that we continuing with the first merged dataset (new).
In here we will do EDA in the merged data and the original data. As during modeling phase we will compare between the metrics of the enhanced dataset and the original dataset.
We first start looking into the areaSqm column. What information can we get from there.
new.describe()
| areaSqm | longitude | latitude | rent | rooms | bedrooms | |
|---|---|---|---|---|---|---|
| count | 99.000000 | 99.000000 | 99.000000 | 99.000000 | 99.000000 | 99.000000 |
| mean | 58.525253 | 4.888757 | 52.361733 | 1459.545455 | 2.222222 | 1.575758 |
| std | 30.912102 | 0.053662 | 0.018053 | 519.373195 | 0.920909 | 0.701178 |
| min | 9.000000 | 4.789949 | 52.321983 | 540.000000 | 1.000000 | 1.000000 |
| 25% | 41.000000 | 4.860916 | 52.355492 | 1040.000000 | 1.000000 | 1.000000 |
| 50% | 60.000000 | 4.891351 | 52.367256 | 1500.000000 | 2.000000 | 1.000000 |
| 75% | 80.500000 | 4.915396 | 52.374161 | 1797.500000 | 3.000000 | 2.000000 |
| max | 120.000000 | 4.972860 | 52.404444 | 2750.000000 | 4.000000 | 3.000000 |
Using the describe function, we notice that the minimum value for the areaSqm is 9, the max is 120 and the average is almost 58. Let’s dive in further. The rent has a mean value if 1459 with a minimum of 540 and the maximum of 2750 to get more information.
#plotting a histgram to see the distribution of the area sqm .
binwidth= 9
plt.hist(new.areaSqm,bins=range(min(new.areaSqm),max(new.areaSqm)+binwidth,binwidth))
plt.show()
#using sns to plot the area with the denstity stat parameter ass to normalize the total area oof histogeam, Kde set to true to plor the distibution of the line
sns.histplot(new.areaSqm, color="red", kde=True, stat="density", linewidth=4)
<AxesSubplot:xlabel='areaSqm', ylabel='Density'>
By plotting the histogram we can see that it is right skewd, which means that it could give more errors and overestimates the outcome variable. In addition to that, it can be seen that the concentration is mostly between 0 and 20, that’s is our median, and this could tell us that most people could not afford big houses.
#here we plot a pair plot to see if there are any linearity betweeen the columns.
sns.pairplot(new,hue='propertyType')
<seaborn.axisgrid.PairGrid at 0x1df4720fd60>
What can be told is that there is a positive correlation between the areaSqm and the rent. What can also be seen between ‘’rent’’ and ‘’areaSqm’’ using the ‘’property type’’ as a hue, is that apartments are on the higher end, studios are in the mid range and rooms are at the low range. Which is interesting as this can prove my hypothesis of area and property type has an impact on rent.
Here we plot an interactive map of the houses in amsterdam based on the longitude and latitude.
#inhere we plot an interactive map as it is easy to use and easy to hoer above
import plotly.express as px
fig = px.scatter_mapbox(new, lat="latitude", lon="longitude", color="rent")
fig.update_layout(mapbox_style="open-street-map")
fig.show()
new
| areaSqm | city | longitude | latitude | toilet | shower | kitchen | living | propertyType | rent | crawlStatus | postalCode | rooms | bedrooms | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 120 | Amsterdam | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1850 | done | 1095 ME | 3 | 2 |
| 1 | 100 | Amsterdam | 4.963605 | 52.374108 | Shared | Shared | Shared | Shared | Apartment | 1785 | done | 1095 ME | 3 | 2 |
| 2 | 120 | Amsterdam | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1850 | done | 1095 ME | 3 | 2 |
| 3 | 120 | Amsterdam | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1795 | done | 1095 ME | 3 | 2 |
| 4 | 44 | Amsterdam | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1200 | done | 1095 ME | 3 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 94 | 80 | Amsterdam | 4.896287 | 52.356954 | Own | Own | Own | Own | Apartment | 1900 | done | 1073 XZ | 3 | 2 |
| 95 | 98 | Amsterdam | 4.943846 | 52.394173 | Own | Own | Own | Own | Apartment | 1650 | done | 1024 PG | 3 | 2 |
| 96 | 48 | Amsterdam | 4.875135 | 52.368200 | Own | Own | Own | Own | Apartment | 1550 | done | 1053 DV | 1 | 1 |
| 97 | 20 | Amsterdam | 4.908423 | 52.369501 | Own | Own | Own | Shared | Apartment | 650 | done | 1011 MK | 3 | 2 |
| 98 | 60 | Amsterdam | 4.800028 | 52.376855 | Shared | Shared | Shared | Shared | Apartment | 1750 | done | 1067 LN | 3 | 2 |
99 rows × 14 columns
#in here we box plot the values to see thee distribution
plt.figure(figsize=(30,20))
#we use sublots to stack the graphs into one pic
plt.subplot(2,3,1)
sns.boxplot(x='propertyType',y='rent',data=new)
plt.subplot(2,3,2)
sns.boxplot(x='bedrooms',y='rent',data=new)
plt.subplot(2,3,3)
sns.boxplot(x='shower',y='rent',data=new)
plt.subplot(2,3,4)
sns.boxplot(x='kitchen',y='rent',data=new)
<AxesSubplot:xlabel='kitchen', ylabel='rent'>
Using the box plot visualization, we can see some interesting EDA about the houses, as boxplots are useful for visualizing distibution, median, range and outliers, there fore the box plot is being used. Lets study them one by one. As it can be seen in the boxplot 50% of the data for apartments is concentrated on 1500-1800, the rest 50% is between the minimum and maximum of 1000-2300 removing the interquartile range. For the rooms type, 50% of the data fall on 700-900 and the rest is between 500-1100 removing the interquartile range. What is interesting here is that the apartments type seem to have a normal distribution following the formula Q3-Q2=Q2-Q1, with lots of outliers, same goes for the rooms. In addition to that what can be told from the box plots is that you can see the properties are over-lapping, which prevents the machine from differentiating between the different property types.
The number of bedrooms box plots, it can be seen that a house with 3 rooms is negativly skewd and a house with 1 room is positivley skewd, following the formula Q3-Q2<Q2-Q1 and Q3-Q2>Q2-Q1.
If we look at the showerboxplot, it can be seen that the shared shower seems to have a negative skew in the data unlike the owned which is positively skewd.
In the kitchen boxplot, it can be seen that "None " has only 1 line which shows that it is the least wanted feature. THe shared kitchen seems to have a huge distibution which is negativel skewd following the formula Q3-Q2<Q2-Q1. The own kitchen seems to have a normal distribution and slightely positively skewd.
#here we plot the correlation heat map of the numrical ariables.
cor=new.corr()
f,ax=plt.subplots(figsize=(10,10))
sns.heatmap(cor, xticklabels=cor.columns, yticklabels=cor.columns,
annot=True, cmap=sns.diverging_palette(220, 700, as_cmap=True))
<AxesSubplot:>
The correlation of numrical variable, it can be seen Through this heatmap, that areasqm has a high correlation of 0.81 with the rent price.Furthermore, it is not as what was expected there is a low correlation between the rent and number of bedrooms. It can also be seen that the area square meter has a good correlation with the number of rooms but less witht he numbe rof bedrooms, but there is a corealtionof 0.6 beween the bedroms and longtitude.
# before we continue further lets drop un needed columns
new=new.drop(data.columns[1],axis=1)
new=new.drop(data.columns[10],axis=1)
new=new.drop(data.columns[11],axis=1)
new
| areaSqm | longitude | latitude | toilet | shower | kitchen | living | propertyType | rent | rooms | bedrooms | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 120 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1850 | 3 | 2 |
| 1 | 100 | 4.963605 | 52.374108 | Shared | Shared | Shared | Shared | Apartment | 1785 | 3 | 2 |
| 2 | 120 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1850 | 3 | 2 |
| 3 | 120 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1795 | 3 | 2 |
| 4 | 44 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1200 | 3 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 94 | 80 | 4.896287 | 52.356954 | Own | Own | Own | Own | Apartment | 1900 | 3 | 2 |
| 95 | 98 | 4.943846 | 52.394173 | Own | Own | Own | Own | Apartment | 1650 | 3 | 2 |
| 96 | 48 | 4.875135 | 52.368200 | Own | Own | Own | Own | Apartment | 1550 | 1 | 1 |
| 97 | 20 | 4.908423 | 52.369501 | Own | Own | Own | Shared | Apartment | 650 | 3 | 2 |
| 98 | 60 | 4.800028 | 52.376855 | Shared | Shared | Shared | Shared | Apartment | 1750 | 3 | 2 |
99 rows × 11 columns
#here we plot the correlation heat map for all numerical and caegorical data
from dython.nominal import associations
compcor=associations(new,filename = 'corr.png',figsize=(10,10))
It can be seen that rent has a high correlation with property type with a correlation of 0.67, it also has low correlation with bedrooms 0.07. Rent also has a low correlation between toilet, shower and kitchen but it also has a high coorelation with the rooms, living ,and area sqm.
Based on this correlation heat map we will take only the features that have a correlation with the rent that is higher than 30 and they are:
#create a data frame wth the wanted columns
prepdata=new[['areaSqm','longitude','latitude','toilet','shower','kitchen',
'living','propertyType','rent','rooms']]
prepdata
| areaSqm | longitude | latitude | toilet | shower | kitchen | living | propertyType | rent | rooms | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 120 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1850 | 3 |
| 1 | 100 | 4.963605 | 52.374108 | Shared | Shared | Shared | Shared | Apartment | 1785 | 3 |
| 2 | 120 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1850 | 3 |
| 3 | 120 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1795 | 3 |
| 4 | 44 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1200 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 94 | 80 | 4.896287 | 52.356954 | Own | Own | Own | Own | Apartment | 1900 | 3 |
| 95 | 98 | 4.943846 | 52.394173 | Own | Own | Own | Own | Apartment | 1650 | 3 |
| 96 | 48 | 4.875135 | 52.368200 | Own | Own | Own | Own | Apartment | 1550 | 1 |
| 97 | 20 | 4.908423 | 52.369501 | Own | Own | Own | Shared | Apartment | 650 | 3 |
| 98 | 60 | 4.800028 | 52.376855 | Shared | Shared | Shared | Shared | Apartment | 1750 | 3 |
99 rows × 10 columns
We first start looking into the areaSqm column. What information can we get from there.
# get to display the info of the dataframe
house.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 8074 entries, 1 to 46714 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 areaSqm 8074 non-null int64 1 city 8074 non-null object 2 longitude 8074 non-null float64 3 latitude 8074 non-null float64 4 toilet 8074 non-null object 5 shower 8074 non-null object 6 kitchen 8074 non-null object 7 living 8074 non-null object 8 propertyType 8074 non-null object 9 rent 8074 non-null int64 10 crawlStatus 8074 non-null object 11 postalCode 8074 non-null object dtypes: float64(2), int64(2), object(8) memory usage: 820.0+ KB
house.describe()
| areaSqm | longitude | latitude | rent | |
|---|---|---|---|---|
| count | 8074.000000 | 8074.000000 | 8074.000000 | 8074.000000 |
| mean | 31.988358 | 4.888667 | 52.358590 | 970.569854 |
| std | 29.334260 | 0.053321 | 0.024571 | 513.987638 |
| min | 6.000000 | 4.771238 | 52.289948 | 1.000000 |
| 25% | 12.000000 | 4.851282 | 52.347484 | 620.000000 |
| 50% | 18.000000 | 4.887185 | 52.359754 | 795.000000 |
| 75% | 48.000000 | 4.926215 | 52.374210 | 1250.000000 |
| max | 280.000000 | 5.018227 | 52.423914 | 5000.000000 |
Using the describe function, we notice that the minimum value for the areaSqm is 6, the max is 280 and the average is almost 32. Let’s dive in further, to get more information.
#in here we plot a histogram to display the distribution of the data =
binwidth= 8
plt.hist(house.areaSqm,bins=range(min(house.areaSqm),max(house.areaSqm)+binwidth,binwidth))
plt.show()
By plotting the histogram we can see that it is right skewd, which means that it could give more errors and overestimates the outcome variable. In addition to that, it can be seen that the concentration is mostly between 0 and 50, that’s is our median, and this could tell us that most people could not afford big houses.
#in here we plot a histogram to display the distribution of the rent data with binwidth of 100 per class. as if smaller number was used it would be hard to see
binwidth= 100
plt.hist(house.rent,bins=range(min(house.rent),max(house.rent)+binwidth,binwidth))
plt.show()
According to what can be seen above is that the distribution of the rent is rightly skewd which means that it could lead to an overestimation as it is not a normal bell curve. but the median can be seen which seems to be around 700 euros.
import seaborn as sns
sns.pairplot(house,hue='propertyType')
<seaborn.axisgrid.PairGrid at 0x1df49a82820>
What can be told is that there is a positive correlation between the areaSqm and the rent. What can also be seen between ‘’rent’’ and ‘’areaSqm’’ using the ‘’property type’’ as a hue, is that apartments are on the higher end, studios are in the mid range and rooms are at the low range.
plt.figure(figsize=(20,30))
plt.subplot(2,3,1)
sns.boxplot(x='propertyType',y='rent',data=house)
plt.subplot(2,3,2)
sns.boxplot(x='shower',y='rent',data=house)
plt.subplot(2,3,3)
sns.boxplot(x='kitchen',y='rent',data=house)
plt.subplot(2,3,4)
sns.boxplot(x='living',y='rent',data=house)
<AxesSubplot:xlabel='living', ylabel='rent'>
Using the box plot visualization, we can see some interesting EDA about the houses. Lets study them one by one. As it can be seen in the boxplot of the property type, we can see only a line for the student resident, which means that we don’t have much information about it, but on the other hand, it seems that anti-squat houses are the cheapest and the less wanted and the median falls under 500. 50% of the data for apartments is concentrated on 1300-1800, the rest 50% is between the minimum and maximum of 800-2700 removing the interquartile range. For the rooms type, 50% of the data fall on 700-900 and the rest is between 400-1100 removing the interquartile range. For studios 50% of the data falls between 900-1100 and the rest is between 500-2000 removing the interquartile range. What is interesting here is that the apartments type seem to have a normal distribution following the formula Q3-Q2=Q2-Q1, with lots of outliers, but for the others we see that they are negatively skewd based on the formula of Q3-Q2<Q2-Q1.In addition to that what can be told from the box plots is that you can see the properties are over-lapping, which prevents the machine from differentiating between the different property types.
What can also be seen here which is interesting is that all the other box plots share the same distibution for owned and shared categories with alot of outliers.
Now we plot the prices on the map and Looking at the map we can see that the further you get away from the city the less expensive the rent is. This plot is based on latitiude and longitude.
#inhere we plot an interactive map as it is easy to use and easy to hoer above
import plotly.express as px
fig = px.scatter_mapbox(house, lat="latitude", lon="longitude", color="rent")
fig.update_layout(mapbox_style="open-street-map")
fig.show()
#drop city as we already know its amsterdam
house=house.drop(house.columns[1],axis=1)
#drop the crwaled stats
house=house.drop(house.columns[10],axis=1)
#drop the zipcode
house=house.drop(house.columns[9],axis=1)
house
| areaSqm | longitude | latitude | toilet | shower | kitchen | living | propertyType | rent | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 30 | 4.920721 | 52.370200 | Own | Own | Own | Own | Studio | 950 |
| 2 | 11 | 4.854786 | 52.350880 | Shared | Shared | Shared | Shared | Room | 1000 |
| 11 | 60 | 4.879218 | 52.354884 | Own | Own | Own | Own | Apartment | 1590 |
| 17 | 19 | 4.976048 | 52.326211 | Shared | Shared | Shared | Shared | Room | 750 |
| 23 | 12 | 4.824007 | 52.352244 | Own | Own | Shared | Own | Room | 800 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 46695 | 20 | 4.988951 | 52.318325 | Shared | Shared | Shared | Shared | Room | 1000 |
| 46698 | 10 | 4.877377 | 52.384534 | Shared | Shared | Shared | Shared | Room | 750 |
| 46701 | 15 | 4.823299 | 52.379522 | Shared | Shared | Shared | Shared | Room | 750 |
| 46710 | 10 | 4.985070 | 52.320498 | Shared | Shared | Shared | Shared | Room | 550 |
| 46714 | 8 | 4.824709 | 52.382689 | Shared | Shared | Shared | None | Room | 700 |
8074 rows × 9 columns
# here we find the correlation for numrical categories
cor=house.corr()
f,ax=plt.subplots(figsize=(10,10))
#using the heatmap from the seaborn library makes it easeier for us to visualize
sns.heatmap(cor, xticklabels=cor.columns, yticklabels=cor.columns,
annot=True, cmap=sns.diverging_palette(220, 700, as_cmap=True))
<AxesSubplot:>
The correlation of numrical variable, it can be seen Through this heatmap, that areasqm has a high correlation of 0.83 with the rent price.
#identify nominal columns to plot a correlaton heatmap of all columns
from dython.nominal import associations, identify_nominal_columns
compcor=associations(house,filename = 'corr.png',figsize=(10,10))
cf=identify_nominal_columns(house)
cf
['toilet', 'shower', 'kitchen', 'living', 'propertyType']
it can be observed that there is a high correlation of 0.83 between the area square and the rent, 0.80 betweeen the property type and areasqm and a 0.77 correaltion between the property type and rent. the Other facilities like toilet, shower and kitchen all share the same correlation of 0.38.
In here we will prepare the data sets to be fed into the model by hot encode both datasets the merged data and the original data for the reason of comparison.
Since that most of the features that we have are categorical we are going to use a general approach that doesnt presume oredering. Therefore, we are going to use hot encoding with binary numbers were the instance either belongs to 0 or 1. That is also because that regression algorithim do not take objects or strigns, it only accepts numerical values. Therefore, The best approach for our categorical values is to hot encode them to numericals with binary numbers were the instance either belongs to 0 or 1.
# here we map all to either 0 or 1
prepdata['owntoilet'] = prepdata['toilet'].map( {'Own': 1, 'Shared': 0, 'None': 0} ).astype(int)
prepdata['sharedtoilet'] = prepdata['toilet'].map( {'Own': 0, 'Shared': 1, 'None': 0} ).astype(int)
prepdata['notoilet'] = prepdata['toilet'].map( {'Own': 0, 'Shared': 0, 'None': 1} ).astype(int)
# here we map all to either 0 or 1
prepdata['ownshower'] = prepdata['shower'].map( {'Own': 1, 'Shared': 0, 'None': 0} ).astype(int)
prepdata['sharedshower'] = prepdata['shower'].map( {'Own': 0, 'Shared': 1, 'None': 0} ).astype(int)
prepdata['noshower'] = prepdata['shower'].map( {'Own': 0, 'Shared': 0, 'None': 1} ).astype(int)
# here we map all to either 0 or 1
prepdata['ownkitchen'] = prepdata['kitchen'].map( {'Own': 1, 'Shared': 0, 'None': 0} ).astype(int)
prepdata['sharedkitchen'] = prepdata['kitchen'].map( {'Own': 0, 'Shared': 1, 'None': 0} ).astype(int)
prepdata['nokitchen'] = prepdata['kitchen'].map( {'Own': 0, 'Shared': 0, 'None': 1} ).astype(int)
# here we map all to either 0 or 1
prepdata['ownliving'] = prepdata['living'].map( {'Own': 1, 'Shared': 0, 'None': 0} ).astype(int)
prepdata['sharedliving'] = prepdata['living'].map( {'Own': 0, 'Shared': 1, 'None': 0} ).astype(int)
prepdata['noliving'] = prepdata['living'].map( {'Own': 0, 'Shared': 0, 'None': 1} ).astype(int)
# here we map all to either 0 or 1
prepdata['kamer'] = prepdata['propertyType'].map( { 'Room': 1, 'Apartment': 0} ).astype(int)
prepdata['appartement'] = prepdata['propertyType'].map( {'Room': 0, 'Apartment': 1} ).astype(int)
#prepdata['rooms1'] = huurwon['rooms'].map( { '1.0': 1, '2.0': 0, '3.0':0, '4.0':0 } ).astype(int)
#prepdata['rooms2'] = huurwon['rooms'].map( {'1.0': 0, '2.0': 1, '3.0':0, '4.0':0 } ).astype(int)
#prepdata['rooms3'] = huurwon['rooms'].map( {'1.0': 0, '2.0': 0, '3.0':1, '4.0':0 } ).astype(int)
#prepdata['rooms4'] = huurwon['rooms'].map( {'1.0': 0, '2.0': 0, '3.0':0, '4.0':1 } ).astype(int)
prepdata
| areaSqm | longitude | latitude | toilet | shower | kitchen | living | propertyType | rent | rooms | owntoilet | sharedtoilet | notoilet | ownshower | sharedshower | noshower | ownkitchen | sharedkitchen | nokitchen | ownliving | sharedliving | noliving | kamer | appartement | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 120 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1850 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 1 | 100 | 4.963605 | 52.374108 | Shared | Shared | Shared | Shared | Apartment | 1785 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
| 2 | 120 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1850 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 3 | 120 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1795 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 4 | 44 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1200 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 94 | 80 | 4.896287 | 52.356954 | Own | Own | Own | Own | Apartment | 1900 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 95 | 98 | 4.943846 | 52.394173 | Own | Own | Own | Own | Apartment | 1650 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 96 | 48 | 4.875135 | 52.368200 | Own | Own | Own | Own | Apartment | 1550 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 97 | 20 | 4.908423 | 52.369501 | Own | Own | Own | Shared | Apartment | 650 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 98 | 60 | 4.800028 | 52.376855 | Shared | Shared | Shared | Shared | Apartment | 1750 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
99 rows × 24 columns
# here we use get dummies function as if we try to map it as we did the rest, it gives an error of invisble NAs
prepdata[['room1','room2','room3','rooms4']]=pd.get_dummies(prepdata.rooms)
prepdata
| areaSqm | longitude | latitude | toilet | shower | kitchen | living | propertyType | rent | rooms | owntoilet | sharedtoilet | notoilet | ownshower | sharedshower | noshower | ownkitchen | sharedkitchen | nokitchen | ownliving | sharedliving | noliving | kamer | appartement | room1 | room2 | room3 | rooms4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 120 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1850 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 1 | 100 | 4.963605 | 52.374108 | Shared | Shared | Shared | Shared | Apartment | 1785 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 2 | 120 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1850 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 120 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1795 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 4 | 44 | 4.963605 | 52.374108 | Own | Own | Own | Own | Apartment | 1200 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 94 | 80 | 4.896287 | 52.356954 | Own | Own | Own | Own | Apartment | 1900 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 95 | 98 | 4.943846 | 52.394173 | Own | Own | Own | Own | Apartment | 1650 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 96 | 48 | 4.875135 | 52.368200 | Own | Own | Own | Own | Apartment | 1550 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
| 97 | 20 | 4.908423 | 52.369501 | Own | Own | Own | Shared | Apartment | 650 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 98 | 60 | 4.800028 | 52.376855 | Shared | Shared | Shared | Shared | Apartment | 1750 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
99 rows × 28 columns
prepdata.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 99 entries, 0 to 98 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 areaSqm 99 non-null int64 1 longitude 99 non-null float64 2 latitude 99 non-null float64 3 toilet 99 non-null object 4 shower 99 non-null object 5 kitchen 99 non-null object 6 living 99 non-null object 7 propertyType 99 non-null object 8 rent 99 non-null int64 9 rooms 99 non-null int32 10 owntoilet 99 non-null int32 11 sharedtoilet 99 non-null int32 12 notoilet 99 non-null int32 13 ownshower 99 non-null int32 14 sharedshower 99 non-null int32 15 noshower 99 non-null int32 16 ownkitchen 99 non-null int32 17 sharedkitchen 99 non-null int32 18 nokitchen 99 non-null int32 19 ownliving 99 non-null int32 20 sharedliving 99 non-null int32 21 noliving 99 non-null int32 22 kamer 99 non-null int32 23 appartement 99 non-null int32 24 room1 99 non-null uint8 25 room2 99 non-null uint8 26 room3 99 non-null uint8 27 rooms4 99 non-null uint8 dtypes: float64(2), int32(15), int64(2), object(5), uint8(4) memory usage: 16.0+ KB
Since that most of the features that we have are categorical we are going to use a general approach that doesnt presume oredering. Therefore, we are going to use hot encoding with binary numbers were the instance either belongs to 0 or 1. That is also because that regression algorithim do not take objects or strigns, it only accepts numerical values. Therefore, The best approach for our categorical values is to hot encode them to numericals with binary numbers were the instance either belongs to 0 or 1.
# here we map alltoilet categories to columns conaining only either 0 or 1
house['owntoilet'] = house['toilet'].map( {'Own': 1, 'Shared': 0, 'None': 0} ).astype(int)
house['sharedtoilet'] = house['toilet'].map( {'Own': 0, 'Shared': 1, 'None': 0} ).astype(int)
house['notoilet'] = house['toilet'].map( {'Own': 0, 'Shared': 0, 'None': 1} ).astype(int)
# here we map all shower categories to columns conaining only either 0 or 1
house['ownshower'] = house['shower'].map( {'Own': 1, 'Shared': 0, 'None': 0} ).astype(int)
house['sharedshower'] = house['shower'].map( {'Own': 0, 'Shared': 1, 'None': 0} ).astype(int)
house['noshower'] = house['shower'].map( {'Own': 0, 'Shared': 0, 'None': 1} ).astype(int)
# here we map all kitchen categories to columns conaining only either 0 or 1
house['ownkitchen'] = house['kitchen'].map( {'Own': 1, 'Shared': 0, 'None': 0} ).astype(int)
house['sharedkitchen'] = house['kitchen'].map( {'Own': 0, 'Shared': 1, 'None': 0} ).astype(int)
house['nokitchen'] = house['kitchen'].map( {'Own': 0, 'Shared': 0, 'None': 1} ).astype(int)
# here we map all living room categories to columns conaining only either 0 or 1
house['ownliving'] = house['living'].map( {'Own': 1, 'Shared': 0, 'None': 0} ).astype(int)
house['sharedliving'] = house['living'].map( {'Own': 0, 'Shared': 1, 'None': 0} ).astype(int)
house['noliving'] = house['living'].map( {'Own': 0, 'Shared': 0, 'None': 1} ).astype(int)
# here we map all property types categories to columns conaining only either 0 or 1
house['studio'] = house['propertyType'].map( {'Studio': 1, 'Room': 0, 'Apartment': 0, 'Anti-squat': 0, 'Student residence': 0} ).astype(int)
house['room'] = house['propertyType'].map( {'Studio': 0, 'Room': 1, 'Apartment': 0, 'Anti-squat': 0, 'Student residence': 0} ).astype(int)
house['appartement'] = house['propertyType'].map( {'Studio': 0, 'Room': 0, 'Apartment': 1, 'Anti-squat': 0, 'Student residence': 0} ).astype(int)
house['anti'] = house['propertyType'].map( {'Studio': 0, 'Room': 0, 'Apartment': 0, 'Anti-squat': 1, 'Student residence': 0} ).astype(int)
house['student res'] = house['propertyType'].map( {'Studio': 0, 'Room': 0, 'Apartment': 0, 'Anti-squat': 0, 'Student residence': 1} ).astype(int)
house
| areaSqm | longitude | latitude | toilet | shower | kitchen | living | propertyType | rent | owntoilet | sharedtoilet | notoilet | ownshower | sharedshower | noshower | ownkitchen | sharedkitchen | nokitchen | ownliving | sharedliving | noliving | studio | room | appartement | anti | student res | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 30 | 4.920721 | 52.370200 | Own | Own | Own | Own | Studio | 950 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 11 | 4.854786 | 52.350880 | Shared | Shared | Shared | Shared | Room | 1000 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 11 | 60 | 4.879218 | 52.354884 | Own | Own | Own | Own | Apartment | 1590 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 17 | 19 | 4.976048 | 52.326211 | Shared | Shared | Shared | Shared | Room | 750 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 23 | 12 | 4.824007 | 52.352244 | Own | Own | Shared | Own | Room | 800 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 46695 | 20 | 4.988951 | 52.318325 | Shared | Shared | Shared | Shared | Room | 1000 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46698 | 10 | 4.877377 | 52.384534 | Shared | Shared | Shared | Shared | Room | 750 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46701 | 15 | 4.823299 | 52.379522 | Shared | Shared | Shared | Shared | Room | 750 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46710 | 10 | 4.985070 | 52.320498 | Shared | Shared | Shared | Shared | Room | 550 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46714 | 8 | 4.824709 | 52.382689 | Shared | Shared | Shared | None | Room | 700 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
8074 rows × 26 columns
So now we have hot encoded the data, so that every unique value per row has its own column and own Unique number as it can be seen above. In the next chapter we will preprocess these data sets and prepare them for modeling where we will have to drop the categorical Object columns and use the hot encode.
from sklearn.model_selection import train_test_split
# in X we drop the columns which are already hot encoded and the non correlated columns too
X = prepdata.drop(['rent', 'toilet',
'shower',
'kitchen',
'living',
'propertyType','rooms'], axis=1)
y = prepdata['rent']
# we split the data in training and testing we choose also the 30% of the data to test it and 70% to train it with a random state of 42 ti start the internal random number generator and split the data randomly.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
here we will preproccess the original dataset.
# in X we drop the columns which are already hot encoded and the non correlated columns too
X_orig = house.drop(['rent', 'toilet',
'shower',
'kitchen',
'living',
'propertyType'], axis=1)
y_orig = house['rent']
# we split the data in training and testing we choose also the 30% of the data to test it and 70% to train it with a random state of 42 ti start the internal random number generator and split the data randomly.
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(
X_orig, y_orig, test_size=0.3, random_state=0
)
In here we are going to start the modeling and we are going to use various regression algorithims to find the most suitable one. The reason to why we use regression in here, it is because it is a prediction of how much and so we use regression. Due to regession having multiple algorithims, we will use a couple of them and compare between them. Furthermore, we would do the modeling on the merged data set and the original dataset to compare. finally we will also do the modeling between both data sets with and without hyper parameter tuning.
here we start with the merged dataset without any hyperparameter tuning.
linear regression is an analysis that is used to forecast a value based on a value of a different column. It gives better results if there is a linearity between the variables.
from sklearn.linear_model import LinearRegression
#he we fit the model, predict and find the slope and the incerept
slr = LinearRegression()
slr.fit(X_train, y_train)
#we predict the daa and print the slope and intercept
y_train_pred = slr.predict(X_train)
y_test_pred = slr.predict(X_test)
print('Slope: %.3f' % slr.coef_[0])
print('Intercept: %.3f' % slr.intercept_)
Slope: 9.649 Intercept: 397329.162
As it can be seen above, this model has a slope of 9.649 which is a positive slope and this tells u that there is a linearity between x and y as with an intercepts(b0)397329.162 but if it was 0 then there would be no linearity. based on this data we could say that the regression's line formula is y=9.649m+397329.162.
from sklearn.metrics import mean_squared_error
# we print the mean squared error
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
MSE train: 57211.446, test: 120919.946
The mean squared error is very high as it can be seen for the traiining dataset it is 57211.446 and for the test it is 120919.946 . This is the distnace between the residuals and the line. This implies on how the training is doing better than testing which is not what we are interested in.
from sklearn.metrics import r2_score
#print the r squared using the mean squared error function
rmsedtl = (np.sqrt(mean_squared_error(y_test, y_test_pred)))
r2train=r2_score(y_train, y_train_pred)
r2test=r2_score(y_test, y_test_pred)
print('R^2 train: %.3f, test: %.3f' %
(r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
print("RMSE: ", rmsedtl)
R^2 train: 0.778, test: 0.578 RMSE: 347.73545413525034
as it can be seen above the R square value for training is 77 % and for testing is 57% this is the percentage of how much of this data can this model predict. this tells us that the model cant predict the unseen data correctly. In addtion to that, if we look at te RMSE which is the standard deviation of the riseduals, it has a RMSE of 347.7 and this tells us that if we used this model to predict the rent price our model would be off by 347 euros.
#Kfold validate our results
from sklearn.model_selection import KFold, cross_val_score,StratifiedKFold
# we set the splits to 5 iterations
#we set the shuffle to true so that data will be shuuffked befre splitting
#we set random state to 26 as it affect the ordering of indices to control the randomness comparing to 0,30,42,50,56 which all had a low score
kfold=KFold(n_splits=5,random_state=26,shuffle=True)
#cross validate
res=cross_val_score(slr,X_train,y_train,cv=kfold)
print('cross val percentage:',res)
print( 'mean cros val percentage :',res.mean())
cross val percentage: [0.52774055 0.56343004 0.67833332 0.75293142 0.72413827] mean cros val percentage : 0.6493147172705903
it can be seen that this model in the first iteration had 52 % accuray and it kept increasing till the 5th iteration it decreases to 72% with a mean value of 64%.
Decison tree is a form of supervised learning where it used mostly in classification problems, but since that we have a target variable it can be used in regression and as long as the target was in the range of the values in the training data set. That is why it is a good idea to split the data randomly and set the random state parameter.
#load the decision tree library
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
#fit the model and set the randomstate to 48 as when other random states were tried and 48 seems to be the best
regressordt = DecisionTreeRegressor(random_state = 48)
regressordt.fit(X_train, y_train)
DecisionTreeRegressor(random_state=48)
# Predicting R squared for the Train set
ypreddt_train = regressordt.predict(X_train)
r2_scoredt_train = r2_score(y_train, ypreddt_train)
# Predicting R squared for the Test set
ypreddt_test = regressordt.predict(X_test)
r2_scoredt_test = r2_score(y_test, ypreddt_test)
# Predicting RMSE for the Test results
rmsedt = (np.sqrt(mean_squared_error(y_test, ypreddt_test)))
print('Number of tree nodes:',regressordt.fit(X_train,y_train).tree_.node_count)
print('R squared (train): ', r2_scoredt_train)
print('R squared (test): ', r2_scoredt_test)
print("RMSE: ", rmsedt)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, ypreddt_train),
mean_squared_error(y_test, ypreddt_test)))
Number of tree nodes: 83 R squared (train): 0.9908098760156716 R squared (test): 0.9269266755515975 RMSE: 144.7443467839968 MSE train: 2368.901, test: 20950.926
As it can be seen the r squred for the training data set is really high as it is 99% but the testing did poorly as it as a r squared of 92%. This indicates that the testing is doing poorlier than the training with 7%.This is very good as this shows that this model can predict unseen data correctly. fruther more if we look at the RMSE our model would be off prediction by 144 euros.In addition to that the testing mse is lower than the training mse which indicates that this model is really good.
#Kfold validate our results
from sklearn.model_selection import KFold, cross_val_score
# we set the splits to 5 iterations
#we set the shuffle to true so that data will be shuuffked befre splitting
#we set random state to 52 as it affect the ordering of indices to control the randomness comparing to 0,30,42,50,56 which all had a low score
kfold=KFold(n_splits=5,random_state=52,shuffle=True)
#cross validate
res=cross_val_score(regressordt,X_train,y_train,cv=kfold)
print('cross val percentage:',res)
print( 'mean cros val percentage :',res.mean())
cross val percentage: [0.91031673 0.90575889 0.29999093 0.74964139 0.85715425] mean cros val percentage : 0.744572441292583
it can be seen that this model in the first iteration had 91 % accuray and it kept increasing till the 3th iteration it decreases to 29% and then increases with a mean value of 74%. now lets see the distribution between the original val and predicted.
#distribution plot to see the difference bewteen the origainal values and the prediced values
sns.distplot(y_test-ypreddt_test)
C:\Users\ramya\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
<AxesSubplot:xlabel='rent', ylabel='Density'>
we see the distibution is almost a bell shape curve with a bit of skew to the right. we cant make the conclusion that our model is working good through this.this might imply that there will be an over estimation
#scatter plot of the predicted and actual values.
plt.scatter(y_test,ypreddt_test)
<matplotlib.collections.PathCollection at 0x1df47ca7040>
sns.regplot(y_test,ypreddt_test)
C:\Users\ramya\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
<AxesSubplot:xlabel='rent'>
We can see that there is a linearity between the variables which indicates that the range of the predictied is are at the same range of the original.
Ridge regression is an algorithim used on multi regression data, as it is suitable for data that has a big number of predictors and less number of observations.
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
# we are going to introduces a list of processes where we are going to use pipeline on to sequentily apply them.
process = [
('scalar', StandardScaler()),
('model', Ridge(alpha=3, fit_intercept=True))
]
# now we start the process and fir the data
pipe = Pipeline(process)
pipe.fit(X_train, y_train)
Pipeline(steps=[('scalar', StandardScaler()), ('model', Ridge(alpha=3))])
#predciting r squared for training
y_predridge_train = pipe.predict(X_train)
r2_ridge_train = r2_score(y_train, y_predridge_train)
# Predicting R squared for the Test
y_predridge_test = pipe.predict(X_test)
r2_ridge_test = r2_score(y_test, y_predridge_test)
# Predicting RMSE the Test
rmse_ridge = (np.sqrt(mean_squared_error(y_test, y_predridge_test)))
print('R squared (train): ', r2_ridge_train)
print('R squared (test): ', r2_ridge_test)
print("RMSE: ", rmse_ridge)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_predridge_train),
mean_squared_error(y_test, y_predridge_test)))
R squared (train): 0.7740397526996772 R squared (test): 0.5961526779182944 RMSE: 340.27557884950824 MSE train: 58244.856, test: 115787.470
as it can be seen above, the r squared for the testing is less than the training with 18%. The RMSE indicated that this model is off with its calculaions by 340 euros. this isnt really much good of a model to be used on our data set maybe after hypertuning, it might give better results in the next chapter
random forest regression is a supervised learning algorithim that provides a higher cross validation accuray, deals with missing data and it doesnt aloow over fitting trees.
from sklearn.ensemble import RandomForestRegressor
#random forest with 500 estimators and random sate of 42
rf = RandomForestRegressor(n_estimators = 500, random_state = 42)
# we use ravel to retuen a 1D array
rf.fit(X_train, y_train.ravel())
RandomForestRegressor(n_estimators=500, random_state=42)
# Predicting R squared of the Train
y_predrf_train = rf.predict(X_train)
r2_rf_train = r2_score(y_train, y_predrf_train)
# Predicting R2 squared of the Test
y_predrf_test = rf.predict(X_test)
r2_rf_test = r2_score(y_test, y_predrf_test)
# Predicting RMSE the Test
rmse_rf = (np.sqrt(mean_squared_error(y_test, y_predrf_test)))
print('R squared (train): ', r2_rf_train)
print('R squared (test): ', r2_rf_test)
print("RMSE: ", rmse_rf)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_predrf_train),
mean_squared_error(y_test, y_predrf_test)))
R squared (train): 0.9671112939767649 R squared (test): 0.8673914510927428 RMSE: 194.98802441624102 MSE train: 8477.588, test: 38020.330
as it can be seen above using the random forest model, the r squared of the training data set is very high as it is 96% and the testing data is 86 %. furthermore, the RMSe indicates that this model is off by 194 euros. let is first cross validate this using kfold.
#Kfold validate our results
from sklearn.model_selection import KFold, cross_val_score
# we set the splits to 5 iterations
#we set the shuffle to true so that data will be shuuffked befre splitting
#we set random state to 38 as it affect the ordering of indices to control the randomness comparing to 0,39,41,50,56,70 which all had a low score
kfold=KFold(n_splits=5,random_state=38,shuffle=True)
#cross validate
res=cross_val_score(rf,X_train,y_train,cv=kfold)
print('cross val percentage:',res)
print( 'mean cros val percentage :',res.mean())
cross val percentage: [0.80942495 0.91301479 0.90378739 0.51732452 0.84888365] mean cros val percentage : 0.7984870598682752
it can be seen that as the iterations increase the cross validation accuray decreases witha mean of 79% as we train the data with 1 fold and test witht he rest of folds.Now lets see how it is distributed.
#distribution plot to see the difference bewteen the origainal values and the prediced values
sns.distplot(y_test-y_predrf_test)
C:\Users\ramya\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
<AxesSubplot:xlabel='rent', ylabel='Density'>
it can be seen that it alsomst forms a bell curve shape but it is a bit skewd on the right. This might imply that there will be an overestimation in the price.
#scatter plot of the predicted and actual values.
plt.scatter(y_test,y_predrf_test)
<matplotlib.collections.PathCollection at 0x1df4aaa9820>
We can see that there is a linearity between the variables which indicates that the range of the predictied is are at the same range of the original.
linear regression is an analysis that is used to forecast a value based on a value of a different column. It gives better results if there is a linearity between the variables.
#he we fit the model, predict and find the slope and the incerept
slr_orig = LinearRegression()
slr_orig.fit(X_train_orig, y_train_orig)
#we predict the daa and print the slope and intercept
y_train_pred_orig = slr_orig.predict(X_train_orig)
y_test_pred_orig = slr_orig.predict(X_test_orig)
print('Slope: %.3f' % slr_orig.coef_[0])
print('Intercept: %.3f' % slr_orig.intercept_)
Slope: 9.926 Intercept: -25009.332
As it can be seen above, this model has a slope of 9.926 which is a positive slope and this tells u that there is a linearity between x and y as with an intercepts(b0)-25009.332 but if it was 0 then there would be no linearity. based on this data we could say that the regression's line formula is y=9.926m-25009.332.
# we print the mean squared error
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train_orig, y_train_pred_orig),
mean_squared_error(y_test_orig, y_test_pred_orig)))
MSE train: 71900.957, test: 68496.723
The mean squared error is very high as it can be seen for the traiining dataset it is 71900.957 and for the test it is 68496.723 . This is the distnace between the residuals and the line. This implies on how the testing is doing better than training which is what we are interested in.
#print the r squared using the mean squared error function
rmsedtl_orig = (np.sqrt(mean_squared_error(y_test_orig, y_test_pred_orig)))
r2train_orig=r2_score(y_train_orig, y_train_pred_orig)
r2test_orig=r2_score(y_test_orig, y_test_pred_orig)
print('R^2 train: %.3f, test: %.3f' %
(r2_score(y_train_orig, y_train_pred_orig),
r2_score(y_test_orig, y_test_pred_orig)))
print("RMSE: ", rmsedtl_orig)
R^2 train: 0.726, test: 0.743 RMSE: 261.7187862779158
as it can be seen above the R square value for training is 72 % and for testing is 74% this is the percentage of how much of this data can this model predict. this tells us that the model cant predict the unseen data correctly. In addtion to that, if we look at te RMSE which is the standard deviation of the riseduals, it has a RMSE of 261.7 and this tells us that if we used this model to predict the rent price our model would be off by 261.7 euros.
#Kfold validate our results
from sklearn.model_selection import KFold, cross_val_score,StratifiedKFold
# we set the splits to 5 iterations
#we set the shuffle to true so that data will be shuuffked befre splitting
#we set random state to 26 as it affect the ordering of indices to control the randomness comparing to 0,30,42,50,56 which all had a low score
kfold=KFold(n_splits=5,random_state=26,shuffle=True)
#cross validate
res=cross_val_score(slr_orig,X_train_orig,y_train_orig,cv=kfold)
print('cross val percentage:',res)
print( 'mean cros val percentage :',res.mean())
cross val percentage: [0.70115825 0.73228623 0.73031439 0.74578027 0.71015976] mean cros val percentage : 0.7239397804508583
it can be seen that this model in the first iteration had 70 % accuray and it kept increasing till the 5th iteration it decreases to 71% with a mean value of 72%.
Decison tree is a form of supervised learning where it used mostly in classification problems, but since that we have a target variable it can be used in regression and as long as the target was in the range of the values in the training data set. That is why it is a good idea to split the data randomly and set the random state parameter.
#fit the model and set the randomstate to 48 as when other random states were tried and 48 seems to be the best
regressordt_orig = DecisionTreeRegressor(random_state = 48)
regressordt_orig.fit(X_train_orig, y_train_orig)
# Predicting R squared for the Train set
ypreddt_train_orig = regressordt_orig.predict(X_train_orig)
r2_scoredt_train_orig = r2_score(y_train_orig, ypreddt_train_orig)
# Predicting R squared for the Test set
ypreddt_test_orig = regressordt_orig.predict(X_test_orig)
r2_scoredt_test_orig = r2_score(y_test_orig, ypreddt_test_orig)
# Predicting RMSE for the Test results
rmsedt_orig = (np.sqrt(mean_squared_error(y_test_orig, ypreddt_test_orig)))
print('Number of tree nodes:',regressordt_orig.fit(X_train_orig,y_train_orig).tree_.node_count)
print('R squared (train): ', r2_scoredt_train_orig)
print('R squared (test): ', r2_scoredt_test_orig)
print("RMSE: ", rmsedt_orig)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train_orig, ypreddt_train_orig),
mean_squared_error(y_test_orig, ypreddt_test_orig)))
Number of tree nodes: 7443 R squared (train): 0.9932080238173373 R squared (test): 0.7207405935920032 RMSE: 272.8991421534985 MSE train: 1785.333, test: 74473.942
As it can be seen the r squred for the training data set is really high as it is 99% but the testing did poorly as it as a r squared of 72%. This indicates that the testing is doing poorlier than the training with 27%.This is a bit bad as this shows that this model can predict unseen data uncorrectly. fruther more if we look at the RMSE our model would be off prediction by 272 euros.In addition to that the testing mse is higher than the training mse which indicates that this model can be improved
# we set the splits to 5 iterations
#we set the shuffle to true so that data will be shuuffked befre splitting
#we set random state to 52 as it affect the ordering of indices to control the randomness comparing to 0,30,42,50,56 which all had a low score
kfold=KFold(n_splits=5,random_state=52,shuffle=True)
#cross validate
res=cross_val_score(regressordt,X_train_orig,y_train_orig,cv=kfold)
print('cross val percentage:',res)
print( 'mean cros val percentage :',res.mean())
cross val percentage: [0.5817619 0.70149844 0.67760543 0.56880916 0.73300274] mean cros val percentage : 0.6525355340211102
it can be seen that this model in the first iteration had 58 % accuray and it kept increasing till the 3th iteration it decreases to 67% and then increases with a mean value of 65%. now lets see the distribution between the original val and predicted.
#distribution plot to see the difference bewteen the origainal values and the prediced values
sns.distplot(y_test_orig-ypreddt_test_orig)
C:\Users\ramya\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
<AxesSubplot:xlabel='rent', ylabel='Density'>
we can see that it forms a perfect bell shape curve which indicates that it is somewhat accurte.
#scatter plot of the predicted and actual values.
plt.scatter(y_test_orig,ypreddt_test_orig)
<matplotlib.collections.PathCollection at 0x1df4c6eff70>
We can see that there is a linearity between the variables which indicates that the range of the predictied is are at the same range of the original.
Ridge regression is an algorithim used on multi regression data, as it is suitable for data that has a big number of predictors and less number of observations.
# we are going to introduces a list of processes where we are going to use pipeline on to sequentily apply them.
process = [
('scalar', StandardScaler()),
('model', Ridge(alpha=3, fit_intercept=True))
]
# now we start the process and fir the data
pipe_orig = Pipeline(process)
pipe_orig.fit(X_train_orig, y_train_orig)
#predciting r squared for training
y_predridge_train_orig = pipe_orig.predict(X_train_orig)
r2_ridge_train_orig = r2_score(y_train_orig, y_predridge_train_orig)
# Predicting R squared for the Test
y_predridge_test_orig = pipe_orig.predict(X_test_orig)
r2_ridge_test_orig = r2_score(y_test_orig, y_predridge_test_orig)
# Predicting RMSE the Test
rmse_ridge_orig = (np.sqrt(mean_squared_error(y_test_orig, y_predridge_test_orig)))
print('R squared (train): ', r2_ridge_train_orig)
print('R squared (test): ', r2_ridge_test_orig)
print("RMSE: ", rmse_ridge_orig)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train_orig, y_predridge_train_orig),
mean_squared_error(y_test_orig, y_predridge_test_orig)))
R squared (train): 0.7264656371862015 R squared (test): 0.7431182451917456 RMSE: 261.7368641568589 MSE train: 71900.998, test: 68506.186
as it can be seen above, the r squared for the testing is more than the training with 2%. The RMSE indicated that this model is off with its calculaions by 261 euros. this isnt really much good of a model to be used on our data set maybe after hypertuning, it might give better results in the next chapter
random forest regression is a supervised learning algorithim that provides a higher cross validation accuray, deals with missing data and it doesnt aloow over fitting trees.
#random forest with 500 estimators and random sate of 42
rf_orig = RandomForestRegressor(n_estimators = 500, random_state = 42)
# we use ravel to retuen a 1D array
rf_orig.fit(X_train_orig, y_train_orig.ravel())
# Predicting R squared of the Train
y_predrf_train_orig = rf_orig.predict(X_train_orig)
r2_rf_train_orig = r2_score(y_train_orig, y_predrf_train_orig)
# Predicting R2 squared of the Test
y_predrf_test_orig = rf_orig.predict(X_test_orig)
r2_rf_test_orig = r2_score(y_test_orig, y_predrf_test_orig)
# Predicting RMSE the Test
rmse_rf_orig = (np.sqrt(mean_squared_error(y_test_orig, y_predrf_test_orig)))
print('R squared (train): ', r2_rf_train_orig)
print('R squared (test): ', r2_rf_test_orig)
print("RMSE: ", rmse_rf_orig)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train_orig, y_predrf_train_orig),
mean_squared_error(y_test_orig, y_predrf_test_orig)))
R squared (train): 0.9695866062648018 R squared (test): 0.8254699308609713 RMSE: 215.7413517760107 MSE train: 7994.437, test: 46544.331
as it can be seen above using the random forest model, the r squared of the training data set is very high as it is 96% and the testing data is 82 %. furthermore, the RMSe indicates that this model is off by 215 euros. let is first cross validate this using kfold.
#Kfold validate our results
from sklearn.model_selection import KFold, cross_val_score
# we set the splits to 5 iterations
#we set the shuffle to true so that data will be shuuffked befre splitting
#we set random state to 38 as it affect the ordering of indices to control the randomness comparing to 0,39,41,50,56,70 which all had a low score
kfold=KFold(n_splits=5,random_state=38,shuffle=True)
#cross validate
res=cross_val_score(rf_orig,X_train_orig,y_train_orig,cv=kfold)
print('cross val percentage:',res)
print( 'mean cros val percentage :',res.mean())
cross val percentage: [0.82476874 0.82028785 0.74351358 0.8086919 0.81502344] mean cros val percentage : 0.80245710281546
it can be seen that as the iterations increase the cross validation accuray decreases witha mean of 80% as we train the data with 1 fold and test witht he rest of folds.Now lets see how it is distributed.
#distribution plot to see the difference bewteen the origainal values and the prediced values
sns.distplot(y_test_orig-y_predrf_test_orig)
C:\Users\ramya\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
<AxesSubplot:xlabel='rent', ylabel='Density'>
it can be seen that it forms a bell curve witht the median on the zero.
#scatter plot of the predicted and actual values.
plt.scatter(y_test,y_predrf_test)
<matplotlib.collections.PathCollection at 0x1df4c7a0a30>
We can see that there is a linearity between the variables which indicates that the range of the predictied is are at the same range of the original.
There werent any much parameters that were needed so this will be empty as we use the default.
Decison tree is a form of supervised learning where it used mostly in classification problems, but since that we have a target variable it can be used in regression and as long as the target was in the range of the values in the training data set. That is why it is a good idea to split the data randomly and set the random state parameter.
now we will use grid search to find the best parameters tuning for this model.
# Hyper parameters range that we want to search based on
#splitter parameter is used to either split the node on bes or on best random
# max depth is used to identify the depth of the tree, if non it will run till all nodes are expanded which could lead to over fiting
#min sampels leaf is used to split the internal node with minimum number of samples required
# min weight fraction is the minimum fracion of input samples that are required at each leaf node
#max features is the number of features we cnsider whille splittng
#max leaf nodes is the number of leaf nodes if not specificed ther ewill be infinite amount of nodes
parameters={"splitter":["best","random"],
"max_depth" : [1,3,5,7,9,11,12,None],
"min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
"min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
"max_features":["auto","log2","sqrt",None],
"max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90] }
Running the code below could take 15-20 min to search for the parameters.
# calculating different regression metrics
#from sklearn.model_selection import GridSearchCV
#tuningmodel=GridSearchCV(regressordt,param_grid=parameters,scoring='neg_mean_squared_error',cv=3,verbose=3)
#tuningmodel.fit(X_train,y_train)
# print the best hyperparameters
#tuningmodel.best_params_
The output of the previous code are
{'max_depth': 5,
'max_features': 'auto',
'max_leaf_nodes': None,
'min_samples_leaf': 1,
'min_weight_fraction_leaf': 0.1,
'splitter': 'best'}
And now lets fit and predict our variable target using the hyper tuning parameters.
#fitting with the new tuned model
tunedmodel=DecisionTreeRegressor(max_depth=5,max_features='auto',max_leaf_nodes=None,min_samples_leaf=1,min_weight_fraction_leaf=0.1,splitter='best',random_state = 48)
tunedmodel.fit(X_train, y_train)
DecisionTreeRegressor(max_depth=5, max_features='auto',
min_weight_fraction_leaf=0.1, random_state=48)
# Predicting R squared for the Train set
ypreddt_train_tuned = tunedmodel.predict(X_train)
r2_scoredt_train_tuned = r2_score(y_train, ypreddt_train_tuned)
# Predicting R squared for the Test set
ypreddt_test_tuned = tunedmodel.predict(X_test)
r2_scoredt_test_tuned = r2_score(y_test, ypreddt_test_tuned)
# Predicting RMSE for the Test results
rmsedt_tuned = (np.sqrt(mean_squared_error(y_test, ypreddt_test_tuned)))
from sklearn.model_selection import KFold, cross_val_score
kfold=KFold(n_splits=5,random_state=0,shuffle=True)
#cross validate
res=cross_val_score(tunedmodel,X_train,y_train,cv=kfold)
print(res)
print()
print('Number of tree nodes:',tunedmodel.fit(X_train,y_train).tree_.node_count)
print('R squared (train): ', r2_scoredt_train_tuned)
print('R squared (test): ', r2_scoredt_test_tuned)
print("RMSE: ", rmsedt_tuned)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, ypreddt_train_tuned),
mean_squared_error(y_test, ypreddt_test_tuned)))
[0.58415485 0.62187213 0.91163626 0.84213899 0.69324245] Number of tree nodes: 15 R squared (train): 0.8698176858867098 R squared (test): 0.759230486280573 RMSE: 262.73802170224747 MSE train: 33556.567, test: 69031.268
It can be seen that if we compare the tuned model and the non tuned model, it shows that the non tuned has scored better. as the tuned model had a r test score of 75% and a test mse of 69031. On the otherhand, the untuned model had a r2test of 92% and test mse of 20950 which is closer to zero than the tuned model. could that be due to over fitting ??
Ridge regression is an algorithim used on multi regression data, as it is suitable for data that has a big number of predictors and less number of observations.
#store ridge function
ridge=Ridge()
#identify the wanted parameters for the alpha which draws the lines
para={'alpha':[1e-15,1e-10,1e-08,0.01,0.001,1,5,10,20,30,40,50,100]}
#grid search the best parameters of the alpha with 6 as number of folds
ridge_regressor=GridSearchCV(ridge,para,scoring='r2',cv=6)
#fit the data
ridge_regressor.fit( X_train, y_train)
#start the prediction of xtrain
ypredtrain_tuned=ridge_regressor.predict(X_train)
#store the r score of the trains data
r2_train_tuned=r2_score(y_train,ypredtrain_tuned)
#predict the x test
ypredtest_tuned=ridge_regressor.predict(X_test)
#store the r valie of tjhe test
r2_test_tuned=r2_score(y_test,ypredtest_tuned)
#root mean squared error which will display the over estimation
rmse_tuned=(np.sqrt(mean_squared_error(y_test, ypredtest_tuned)))
#now we print all the info needed
print('R squared (train): ', r2_train_tuned)
print('R squared (test): ', r2_test_tuned)
print("RMSE: ", rmse_tuned)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, ypredtrain_tuned),
mean_squared_error(y_test, ypredtest_tuned)))
R squared (train): 0.7272767810472829 R squared (test): 0.6372683578000646 RMSE: 322.48899495453435 MSE train: 70298.757, test: 103999.152
Comparing the tuned model and the non tuned model, it can clearly be seen that the tuned is more accurate than the non tuned as the r squared value of the testing data is 63% compared to the non tuned of 59%. In addition to that this model is over estimating by 322 euros which is less than the non tuned model which was 340.
random forest regression is a supervised learning algorithim that provides a higher cross validation accuray, deals with missing data and it doesnt aloow over fitting trees. now we will use grid search to find the best parameters tuning for this model.
#in here we get the parameter we want to try to get the best hyper parameters for our model
#n estimaors is the numer of trees in the forest
#max features number of featuresfor best fit
#max depth is the max depth of the tree if set to nn could cause over fitting as it continues expanding till all leaver are pure
#criterion is a function to determine the qialiity of the split in squared errror or absolute error
#minsample split is the min number of splits at each node
#min sample leaf is the numbe rof samples required to be at a leafnode
#bootstrap it decides if bootstrap samples sould be used or the whole data set boots strap is selecting subset and estiamting the charachterstics of the rest samples
parameters_grid = {
'n_estimators': [200, 500,100,600,300,800],
'max_depth' : [4,5,6,7,20,None],
'min_samples_split':[2,4,8,1,10],
'min_samples_leaf':[2,4,8,1,10]
}
# here we start seaearching setting the fold to 5 with a scoring of negative mse .
#tuningmodel=GridSearchCV(rf,param_grid=parameters_grid,scoring='neg_mean_squared_error',cv=5,verbose=3)
#tuningmodel.fit(X_train,y_train)
# print the best hyperparameters
#tuningmodel.best_params_
Running the code above would take around 15 minutes. But the best parameters of the grid search were:
{'max_depth': 20,
'min_samples_leaf': 2,
'min_samples_split': 2,
'n_estimators': 100}
#fitting with the new tuned model
tunedmodelrf=RandomForestRegressor(max_depth=20,n_estimators=100,min_samples_leaf=2,min_samples_split=2,random_state = 42)
tunedmodelrf.fit(X_train, y_train)
RandomForestRegressor(max_depth=20, min_samples_leaf=2, random_state=42)
# Predicting R squared for the Train set
ypreddt_train_tuned = tunedmodelrf.predict(X_train)
r2_score_train_tuned = r2_score(y_train, ypreddt_train_tuned)
# Predicting R squared for the Test set
ypreddt_test_tuned = tunedmodelrf.predict(X_test)
r2_score_test_tuned = r2_score(y_test, ypreddt_test_tuned)
# Predicting RMSE for the Test results
rmsedf_tuned = (np.sqrt(mean_squared_error(y_test, ypreddt_test_tuned)))
from sklearn.model_selection import KFold, cross_val_score
kfold=KFold(n_splits=5,random_state=48,shuffle=True)
#cross validate
res=cross_val_score(tunedmodelrf,X_train,y_train,cv=kfold)
print(res)
print()
print('R squared (train): ', r2_score_train_tuned)
print('R squared (test): ', r2_score_test_tuned)
print("RMSE: ", rmsedf_tuned)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, ypreddt_train_tuned),
mean_squared_error(y_test, ypreddt_test_tuned)))
[0.91554827 0.082684 0.85466281 0.61829782 0.86186245] R squared (train): 0.9370409588553212 R squared (test): 0.8610931148202292 RMSE: 199.56485651731685 MSE train: 16228.697, test: 39826.132
This model got the score of 86% for testing and RMSE of 199 which is when compared to the non tuned model we see that it scored lower as the non tuned parameter had an RMSE of 194 and a test score of 86%, is this due tot the fact that the parameters arent done correctly?
Decison tree is a form of supervised learning where it used mostly in classification problems, but since that we have a target variable it can be used in regression and as long as the target was in the range of the values in the training data set. That is why it is a good idea to split the data randomly and set the random state parameter.
now we will use grid search to find the best parameters tuning for this model.
# Hyper parameters range that we want to search based on
#splitter parameter is used to either split the node on bes or on best random
# max depth is used to identify the depth of the tree, if non it will run till all nodes are expanded which could lead to over fiting
#min sampels leaf is used to split the internal node with minimum number of samples required
# min weight fraction is the minimum fracion of input samples that are required at each leaf node
#max features is the number of features we cnsider whille splittng
#max leaf nodes is the number of leaf nodes if not specificed ther ewill be infinite amount of nodes
parameters={"splitter":["best","random"],
"max_depth" : [1,3,5,7,9,11,12,None],
"min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
"min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
"max_features":["auto","log2","sqrt",None],
"max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90] }
Running the code below could take 15-20 min to search for the parameters.
# calculating different regression metrics
#from sklearn.model_selection import GridSearchCV
#tuningmodel_orig=GridSearchCV(regressordt,param_grid=parameters,scoring='neg_mean_squared_error',cv=3,verbose=3)
#tuningmodel_orig.fit(X_train_orig,y_train_orig)
# print the best hyperparameters
#tuningmodel_orig.best_params_
The output of the previous code are
{'max_depth': 5,
'max_features': 'auto',
'max_leaf_nodes': None,
'min_samples_leaf': 1,
'min_weight_fraction_leaf': 0.1,
'splitter': 'best'}
And now lets fit and predict our variable target using the hyper tuning parameters.
#fitting with the new tuned model
tunedmodel_orig=DecisionTreeRegressor(max_depth=5,max_features='auto',max_leaf_nodes=None,min_samples_leaf=1,min_weight_fraction_leaf=0.1,splitter='best',random_state = 48)
tunedmodel_orig.fit(X_train_orig, y_train_orig)
DecisionTreeRegressor(max_depth=5, max_features='auto',
min_weight_fraction_leaf=0.1, random_state=48)
# Predicting R squared for the Train set
ypreddt_train_tuned_orig = tunedmodel_orig.predict(X_train_orig)
r2_scoredt_train_tuned_orig = r2_score(y_train_orig, ypreddt_train_tuned_orig)
# Predicting R squared for the Test set
ypreddt_test_tuned_orig = tunedmodel_orig.predict(X_test_orig)
r2_scoredt_test_tuned_orig = r2_score(y_test_orig, ypreddt_test_tuned_orig)
# Predicting RMSE for the Test results
rmsedt_tuned_orig = (np.sqrt(mean_squared_error(y_test_orig, ypreddt_test_tuned_orig)))
from sklearn.model_selection import KFold, cross_val_score
kfold=KFold(n_splits=5,random_state=0,shuffle=True)
#cross validate
res=cross_val_score(tunedmodel_orig,X_train_orig,y_train_orig,cv=kfold)
print(res)
print()
print('Number of tree nodes:',tunedmodel_orig.fit(X_train_orig,y_train_orig).tree_.node_count)
print('R squared (train): ', r2_scoredt_train_tuned_orig)
print('R squared (test): ', r2_scoredt_test_tuned_orig)
print("RMSE: ", rmsedt_tuned_orig)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train_orig, ypreddt_train_tuned_orig),
mean_squared_error(y_test_orig, ypreddt_test_tuned_orig)))
[0.64421257 0.68352826 0.61513781 0.68895003 0.64494811] Number of tree nodes: 15 R squared (train): 0.6585061710403834 R squared (test): 0.6731432987134378 RMSE: 295.24120004736545 MSE train: 89764.762, test: 87167.366
It can be seen that if we compare the tuned model and the non tuned model, it shows that the non tuned has scored better. as the tuned model had a r test score of 67% and a test mse of 87167.366. On the otherhand, the untuned model had a r2test of 72% and test mse of 74473.942 which is closer to zero than the tuned model. could that be due to over fitting ??
Ridge regression is an algorithim used on multi regression data, as it is suitable for data that has a big number of predictors and less number of observations.
#store ridge function
ridge_tuned_orig=Ridge()
#identify the wanted parameters for the alpha which draws the lines
para={'alpha':[1e-15,1e-10,1e-08,0.01,0.001,1,5,10,20,30,40,50,100]}
#grid search the best parameters of the alpha with 6 as number of folds
ridge_regressor_tuned_orig=GridSearchCV(ridge_tuned_orig,para,scoring='r2',cv=6)
#fit the data
ridge_regressor_tuned_orig.fit( X_train_orig, y_train_orig)
#start the prediction of xtrain
ypredtrain_tuned_orig=ridge_regressor_tuned_orig.predict(X_train_orig)
#store the r score of the trains data
r2_train_tuned_orig=r2_score(y_train_orig,ypredtrain_tuned_orig)
#predict the x test
ypredtest_tuned_orig=ridge_regressor_tuned_orig.predict(X_test_orig)
#store the r valie of tjhe test
r2_test_tuned_orig=r2_score(y_test_orig,ypredtest_tuned_orig)
#root mean squared error which will display the over estimation
rmse_r_orig=(np.sqrt(mean_squared_error(y_test_orig, ypredtest_tuned_orig)))
#now we print all the info needed
print('R squared (train): ', r2_train_tuned_orig)
print('R squared (test): ', r2_test_tuned_orig)
print("RMSE: ", rmse_r_orig)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train_orig, ypredtrain_tuned_orig),
mean_squared_error(y_test_orig, ypredtest_tuned_orig)))
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py:157: LinAlgWarning: Ill-conditioned matrix (rcond=2.08116e-17): result may not be accurate. C:\Users\ramya\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py:157: LinAlgWarning: Ill-conditioned matrix (rcond=2.13111e-17): result may not be accurate. C:\Users\ramya\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py:157: LinAlgWarning: Ill-conditioned matrix (rcond=2.07703e-17): result may not be accurate. C:\Users\ramya\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py:157: LinAlgWarning: Ill-conditioned matrix (rcond=1.97463e-17): result may not be accurate. C:\Users\ramya\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py:157: LinAlgWarning: Ill-conditioned matrix (rcond=2.14535e-17): result may not be accurate. C:\Users\ramya\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py:157: LinAlgWarning: Ill-conditioned matrix (rcond=2.00823e-17): result may not be accurate.
R squared (train): 0.7263892221164848 R squared (test): 0.7434203544222362 RMSE: 261.58290930492143 MSE train: 71921.084, test: 68425.618
Comparing the tuned model and the non tuned model, it can clearly be seen that the tuned is more accurate than the non tuned but still almost the same values as the r squared value of the testing data is 74.34% compared to the non tuned of 74.31%. In addition to that this model is over estimating by 261.5 euros which is less than the non tuned model which was 261.7
random forest regression is a supervised learning algorithim that provides a higher cross validation accuray, deals with missing data and it doesnt aloow over fitting trees. now we will use grid search to find the best parameters tuning for this model.
#in here we get the parameter we want to try to get the best hyper parameters for our model
#n estimaors is the numer of trees in the forest
#max features number of featuresfor best fit
#max depth is the max depth of the tree if set to nn could cause over fitting as it continues expanding till all leaver are pure
#criterion is a function to determine the qialiity of the split in squared errror or absolute error
#minsample split is the min number of splits at each node
#min sample leaf is the numbe rof samples required to be at a leafnode
#bootstrap it decides if bootstrap samples sould be used or the whole data set boots strap is selecting subset and estiamting the charachterstics of the rest samples
#parameters_grid = {
#'n_estimators': [200, 500,,600],
#'max_features': ['auto', 'sqrt', 'log2'],
#'max_depth' : [5,6,7,20],
#'min_samples_split':[2,4,1],
#'min_samples_leaf':[2,4,1],
# 'bootstrap':['True','False']
#}
# here we start seaearching setting the fold to 5 with a scoring of negative mse .
#tuningmodel_orig=GridSearchCV(rf_orig,param_grid=parameters_grid,scoring='neg_mean_squared_error',cv=5,verbose=3)
#tuningmodel_orig.fit(X_train_orig,y_train_orig)
# print the best hyperparameters
#tuningmodel_orig.best_params_
Running the code above would take about 30 minutes. After doing the grid search on the best parameters it shows that these are the best parameters
{'bootstrap': 'True',
'max_depth': 20,
'max_features': 'sqrt',
'min_samples_leaf': 1,
'min_samples_split': 2,
'n_estimators': 600}
#fitting with the new tuned model
tunedmodelrf_orig=RandomForestRegressor(max_depth=20,max_features='sqrt',n_estimators=600,min_samples_leaf=1,min_samples_split=2,random_state = 42)
tunedmodelrf_orig.fit(X_train_orig, y_train_orig)
RandomForestRegressor(max_depth=20, max_features='sqrt', n_estimators=600,
random_state=42)
# Predicting R squared for the Train set
ypreddt_train_orig_tuned = tunedmodelrf_orig.predict(X_train_orig)
r2_score_train_orig_tuned = r2_score(y_train_orig, ypreddt_train_orig_tuned)
# Predicting R squared for the Test set
ypreddt_test_orig_tuned = tunedmodelrf_orig.predict(X_test_orig)
r2_score_test_orig_tuned = r2_score(y_test_orig, ypreddt_test_orig_tuned)
# Predicting RMSE for the Test results
rmsed_rf_orig_tuned = (np.sqrt(mean_squared_error(y_test_orig, ypreddt_test_orig_tuned)))
from sklearn.model_selection import KFold, cross_val_score
kfold=KFold(n_splits=5,random_state=48,shuffle=True)
#cross validate
res=cross_val_score(tunedmodelrf_orig,X_train_orig,y_train_orig,cv=kfold)
print(res)
print()
print('R squared (train): ', r2_score_train_orig_tuned)
print('R squared (test): ', r2_score_test_orig_tuned)
print("RMSE: ", rmsed_rf_orig_tuned)
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train_orig, ypreddt_train_orig_tuned),
mean_squared_error(y_test_orig, ypreddt_test_orig_tuned)))
[0.79548906 0.80309767 0.80927496 0.82788935 0.78720803] R squared (train): 0.9616860043376942 R squared (test): 0.8305491479987904 RMSE: 212.5788942118734 MSE train: 10071.183, test: 45189.786
As it can be seen above the training did better than the testing as the trainig has a r squared of 96 and the testing 83%. we can also see that the RMSE has a value of 212 which means that this model overestimates the price by that amount in eruos. Finally, it can be noticed that the MSE value of the testing is higher than the training, this implies that there is an over fitting.
sns.regplot(y_test_orig,ypreddt_test_orig)
C:\Users\ramya\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
<AxesSubplot:xlabel='rent'>
it can clearly be seen that the non tuned model did better than the tuned model as the test score of the non tuned was 82% compared to the tuned which has 72%. The difference between the RMSE of the non tuned and tuned is 54 euros in favor of the nontuned model.
# this is the table of the original dataset and it is used to follow the format in the prediction
X_orig
| areaSqm | longitude | latitude | owntoilet | sharedtoilet | notoilet | ownshower | sharedshower | noshower | ownkitchen | sharedkitchen | nokitchen | ownliving | sharedliving | noliving | studio | room | appartement | anti | student res | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 30 | 4.920721 | 52.370200 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 11 | 4.854786 | 52.350880 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 11 | 60 | 4.879218 | 52.354884 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 17 | 19 | 4.976048 | 52.326211 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 23 | 12 | 4.824007 | 52.352244 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 46695 | 20 | 4.988951 | 52.318325 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46698 | 10 | 4.877377 | 52.384534 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46701 | 15 | 4.823299 | 52.379522 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46710 | 10 | 4.985070 | 52.320498 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46714 | 8 | 4.824709 | 52.382689 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
8074 rows × 20 columns
#checking features from the table and predicting them following the table
featurez=[['45','4.963605','52.374108','1','0','0','1','0','0','1','0','0','1','0','0','0','1','0','0','0']]
features=[['6','4.908423','52.369501','0','0','1','0','0','1','0','0','1','0','0','1','1','0','0','0','0']]
# in here we start predicting the features based on the regression model wanted
slr_orig.predict(featurez)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:566: FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.
array([888.76316558])
#checking features from the table and predicting them
featurez=[['45','4.963605','52.374108','1','0','0','1','0','0','1','0','0','1','0','0','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
regressordt_orig.predict(featurez)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
array([950.])
#checking features from the table and predicting them
featurez=[['45','4.963605','52.374108','1','0','0','1','0','0','1','0','0','1','0','0','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
pipe_orig.predict(featurez)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
array([888.61230504])
#checking features from the table and predicting them
featurez=[['45','4.963605','52.374108','1','0','0','1','0','0','1','0','0','1','0','0','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
rf_orig.predict(featurez)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
array([1213.44996032])
# # this is the table of the merged dataset and it is used to follow the format in the prediction
X
| areaSqm | longitude | latitude | owntoilet | sharedtoilet | notoilet | ownshower | sharedshower | noshower | ownkitchen | sharedkitchen | nokitchen | ownliving | sharedliving | noliving | kamer | appartement | room1 | room2 | room3 | rooms4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 120 | 4.963605 | 52.374108 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 1 | 100 | 4.963605 | 52.374108 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 2 | 120 | 4.963605 | 52.374108 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 120 | 4.963605 | 52.374108 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 4 | 44 | 4.963605 | 52.374108 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 94 | 80 | 4.896287 | 52.356954 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 95 | 98 | 4.943846 | 52.394173 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 96 | 48 | 4.875135 | 52.368200 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
| 97 | 20 | 4.908423 | 52.369501 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 98 | 60 | 4.800028 | 52.376855 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
99 rows × 21 columns
#checking features from the table and predicting them
features=[['6','4.908423','52.369501','0','0','1','0','0','1','0','0','1','0','0','1','1','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
slr.predict(features)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:566: FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.
array([310.9494051])
#checking features from the table and predicting them
features=[['6','4.908423','52.369501','0','0','1','0','0','1','0','0','1','0','0','1','1','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
tunedmodel.predict(features)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
array([765.])
#checking features from the table and predicting them
features=[['6','4.908423','52.369501','0','0','1','0','0','1','0','0','1','0','0','1','1','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
ridge_regressor.predict(features)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but Ridge was fitted with feature names C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:566: FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.
array([655.11825662])
#checking features from the table and predicting them
features=[['6','4.908423','52.369501','0','0','1','0','0','1','0','0','1','0','0','1','1','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
tunedmodelrf.predict(features)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
array([827.49511905])
# this is the table of the original dataset and it is used to follow the format in the prediction
X_orig
| areaSqm | longitude | latitude | owntoilet | sharedtoilet | notoilet | ownshower | sharedshower | noshower | ownkitchen | sharedkitchen | nokitchen | ownliving | sharedliving | noliving | studio | room | appartement | anti | student res | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 30 | 4.920721 | 52.370200 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 11 | 4.854786 | 52.350880 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 11 | 60 | 4.879218 | 52.354884 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 17 | 19 | 4.976048 | 52.326211 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 23 | 12 | 4.824007 | 52.352244 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 46695 | 20 | 4.988951 | 52.318325 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46698 | 10 | 4.877377 | 52.384534 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46701 | 15 | 4.823299 | 52.379522 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46710 | 10 | 4.985070 | 52.320498 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46714 | 8 | 4.824709 | 52.382689 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
8074 rows × 20 columns
#checking features from the table and predicting them
featurez=[['45','4.963605','52.374108','1','0','0','1','0','0','1','0','0','1','0','0','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
slr_orig.predict(featurez)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:566: FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.
array([888.76316558])
featurez=[['45','4.963605','52.374108','1','0','0','1','0','0','1','0','0','1','0','0','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
tunedmodel_orig.predict(featurez)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
array([1416.36939571])
#checking features from the table and predicting them
featurez=[['45','4.963605','52.374108','1','0','0','1','0','0','1','0','0','1','0','0','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
ridge_regressor_tuned_orig.predict(featurez)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but Ridge was fitted with feature names C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:566: FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.
array([887.93102586])
#checking features from the table and predicting them
featurez=[['45','4.963605','52.374108','1','0','0','1','0','0','1','0','0','1','0','0','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
tunedmodelrf_orig.predict(featurez)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
array([1206.38802778])
# this is the table of the merged dataset and it is used to follow the format in the prediction
X
| areaSqm | longitude | latitude | owntoilet | sharedtoilet | notoilet | ownshower | sharedshower | noshower | ownkitchen | sharedkitchen | nokitchen | ownliving | sharedliving | noliving | kamer | appartement | room1 | room2 | room3 | rooms4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 120 | 4.963605 | 52.374108 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 1 | 100 | 4.963605 | 52.374108 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 2 | 120 | 4.963605 | 52.374108 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 120 | 4.963605 | 52.374108 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 4 | 44 | 4.963605 | 52.374108 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 94 | 80 | 4.896287 | 52.356954 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 95 | 98 | 4.943846 | 52.394173 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 96 | 48 | 4.875135 | 52.368200 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
| 97 | 20 | 4.908423 | 52.369501 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 98 | 60 | 4.800028 | 52.376855 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
99 rows × 21 columns
#checking features from the table and predicting them
features=[['6','4.908423','52.369501','0','0','1','0','0','1','0','0','1','0','0','1','1','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
slr.predict(features)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:566: FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.
array([310.9494051])
#checking features from the table and predicting them
features=[['6','4.908423','52.369501','0','0','1','0','0','1','0','0','1','0','0','1','1','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
regressordt.predict(features)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
array([950.])
#checking features from the table and predicting them
features=[['6','4.908423','52.369501','0','0','1','0','0','1','0','0','1','0','0','1','1','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
pipe.predict(features)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
array([336.29393027])
#checking features from the table and predicting them
features=[['6','4.908423','52.369501','0','0','1','0','0','1','0','0','1','0','0','1','1','0','1','0','0','0']]
# in here we start predicting the features based on the regression model wanted
rf.predict(features)
C:\Users\ramya\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names
array([825.04])
# we are creating a dictionary to compare the models we made
models = [('Linear Regression', rmsedtl, r2train, r2test),
('Ridge Regression', rmse_ridge, r2_ridge_train, r2_ridge_test),
('Decision Tree Regression', rmsedt, r2_scoredt_train, r2_scoredt_test),
('Random Forest Regression', rmse_rf, r2_rf_train, r2_rf_test)
]
#creating a dataframe from the dictionary ad specifying columns
predicted = pd.DataFrame(data = models, columns=['Modeltype', 'RMSE', 'R-squared(training)', 'R-squared(test)'])
predicted
| Modeltype | RMSE | R-squared(training) | R-squared(test) | |
|---|---|---|---|---|
| 0 | Linear Regression | 347.735454 | 0.778049 | 0.578251 |
| 1 | Ridge Regression | 340.275579 | 0.774040 | 0.596153 |
| 2 | Decision Tree Regression | 144.744347 | 0.990810 | 0.926927 |
| 3 | Random Forest Regression | 194.988024 | 0.967111 | 0.867391 |
#here we bar plot the models for easiy comparison
f, axes = plt.subplots(2,1, figsize=(12,8))
#in here we sort the values on r squared in an scending order
predicted.sort_values(by=['R-squared(training)'], ascending=True, inplace=True)
#in here we plot the bar plot based on the data
sns.barplot(x='R-squared(training)', y='Modeltype', data = predicted, palette='Purples_d', ax = axes[0])
axes[0].set_xlabel('R-squared(Training)', size=12)
axes[0].set_ylabel('Modeltype')
axes[0].set_xlim(0,1.0)
#in the 2nd subplot we plot the rsquared of testing and compare
predicted.sort_values(by=['R-squared(test)'], ascending=True, inplace=True)
sns.barplot(x='R-squared(test)', y='Modeltype', data = predicted, palette='Oranges_d', ax = axes[1])
axes[1].set_xlabel('R-squared (Test)', size=12)
axes[1].set_ylabel('Modeltype')
axes[1].set_xlim(0,1.0)
plt.show()
It can be seen that when comparing the models of the merged dataset with default parameters, that the Decision tree and random forest has above 80% for testing r squared and almost 100% in training. This is very interesting as the random forest was expected to do better due to it ppreventing overfitting which implies that my decison tree model is overfit, based on the testing error is higher than the training. Because the model is trained well on these points but when new points are added it does worse.
Below we will explore more on this and compare.
# we are creating a dictionary to compare the models we made
models = [('Linear Regression', rmsedt, r2train, r2test),
('Ridge Regression', rmse_tuned, r2_train_tuned, r2_test_tuned),
('Decision Tree Regression', rmsedt_tuned, r2_scoredt_train_tuned, r2_scoredt_test_tuned),
('Random Forest Regression', rmsedf_tuned, r2_score_train_tuned, r2_score_test_tuned)
]
#creating a dataframe from the dictionary ad specifying columns
predicted1 = pd.DataFrame(data = models, columns=['Modeltype', 'RMSE', 'R-squared(training)', 'R-squared(test)'])
predicted1
| Modeltype | RMSE | R-squared(training) | R-squared(test) | |
|---|---|---|---|---|
| 0 | Linear Regression | 144.744347 | 0.778049 | 0.578251 |
| 1 | Ridge Regression | 322.488995 | 0.727277 | 0.637268 |
| 2 | Decision Tree Regression | 262.738022 | 0.869818 | 0.759230 |
| 3 | Random Forest Regression | 199.564857 | 0.937041 | 0.861093 |
#here we bar plot the models for easiy comparison
f, axes = plt.subplots(2,1, figsize=(12,8))
#in here we sort the values on r squared in an scending order
predicted.sort_values(by=['R-squared(training)'], ascending=True, inplace=True)
#in here we plot the bar plot based on the data
sns.barplot(x='R-squared(training)', y='Modeltype', data = predicted1, palette='Purples_d', ax = axes[0])
axes[0].set_xlabel('R-squared(Training)', size=12)
axes[0].set_ylabel('Modeltype')
axes[0].set_xlim(0,1.0)
#in the 2nd subplot we plot the rsquared of testing and compare
predicted.sort_values(by=['R-squared(test)'], ascending=True, inplace=True)
sns.barplot(x='R-squared(test)', y='Modeltype', data = predicted1, palette='Oranges_d', ax = axes[1])
axes[1].set_xlabel('R-squared (Test)', size=12)
axes[1].set_ylabel('Modeltype')
axes[1].set_xlim(0,1.0)
plt.show()
This is very interesting as when tuned the random forest model did the best and better than the decison tree where it is above 80%. and that is due to the features of the random forest, as it disregards the noise of the data, which prevents overfitting, therefore a higher accuray than the decision tree which is easily prone to overfitting. meanwhile if we compare between the ridge and linear regression we see that the ridge is performig better than the linear in testing and linear is performing better in training and that is due to the fact that our features are too much for it to handle so it overfits so ridge is recommended. that is becaue ridge and lasso are considered as regulization algorithims, which avoids over fitting.
Now if we compare between the non tuned and tuned models we will notice that the non tuned models are doing better than the tuned and that could be either due to the new parameters being worst than the default or that after tuning the model we made it balanced and not overfitted.
# we are creating a dictionary to compare the models we made
models = [('Linear Regression', rmsedt, r2train, r2test),
('Ridge Regression', rmse_r_orig, r2_train_tuned_orig, r2_test_tuned_orig),
('Decision Tree Regression', rmsedt_tuned_orig, r2_scoredt_train_tuned_orig, r2_scoredt_test_tuned_orig),
('Random Forest Regression', rmsed_rf_orig_tuned, r2_score_train_orig_tuned, r2_score_test_orig_tuned)
]
#creating a dataframe from the dictionary ad specifying columns
predicted2 = pd.DataFrame(data = models, columns=['Modeltype', 'RMSE', 'R-squared(training)', 'R-squared(test)'])
predicted2
| Modeltype | RMSE | R-squared(training) | R-squared(test) | |
|---|---|---|---|---|
| 0 | Linear Regression | 144.744347 | 0.778049 | 0.578251 |
| 1 | Ridge Regression | 261.582909 | 0.726389 | 0.743420 |
| 2 | Decision Tree Regression | 295.241200 | 0.658506 | 0.673143 |
| 3 | Random Forest Regression | 212.578894 | 0.961686 | 0.830549 |
#here we bar plot the models for easiy comparison
f, axes = plt.subplots(2,1, figsize=(12,8))
#in here we sort the values on r squared in an scending order
predicted.sort_values(by=['R-squared(training)'], ascending=True, inplace=True)
#in here we plot the bar plot based on the data
sns.barplot(x='R-squared(training)', y='Modeltype', data = predicted2, palette='Purples_d', ax = axes[0])
axes[0].set_xlabel('R-squared(Training)', size=12)
axes[0].set_ylabel('Modeltype')
axes[0].set_xlim(0,1.0)
#in the 2nd subplot we plot the rsquared of testing and compare
predicted.sort_values(by=['R-squared(test)'], ascending=True, inplace=True)
sns.barplot(x='R-squared(test)', y='Modeltype', data = predicted2, palette='Oranges_d', ax = axes[1])
axes[1].set_xlabel('R-squared (Test)', size=12)
axes[1].set_ylabel('Modeltype')
axes[1].set_xlim(0,1.0)
plt.show()
Now if we compare the models of the tuned origianl dataset, we notice that in training the random forest does the best then comes the liner, ridge and decison tree. On the otherhand, in the testing random forest has an r ssquared of 83% compared to the ridge which had almost 80%.
when tuned, the random forest model did the best and better than the decison tree where it is above 80%. and that is due to the features of the random forest, as it disregards the noise of the data, which prevents overfitting, therefore a higher accuray than the decision tree which is easily prone to overfitting. meanwhile if we compare between the ridge and linear regression we see that the ridge is performig better than the linear in testing and linear is performing better in training and that is due to the fact that our features are too much for it to handle so it overfits so ridge is recommended. that is becaue ridge and lasso are considered as regulization algorithims, which avoids over fitting.
# we are creating a dictionary to compare the models we made
models = [('Linear Regression', rmsedtl_orig, r2train_orig, r2test_orig),
('Ridge Regression', rmse_ridge_orig, r2_ridge_train_orig, r2_ridge_test_orig),
('Decision Tree Regression', rmsedt_orig, r2_scoredt_train_orig, r2_scoredt_test_orig),
('Random Forest Regression', rmse_rf_orig, r2_rf_train_orig, r2_rf_test_orig)
]
#creating a dataframe from the dictionary ad specifying columns
predicted3 = pd.DataFrame(data = models, columns=['Modeltype', 'RMSE', 'R-squared(training)', 'R-squared(test)'])
predicted3
| Modeltype | RMSE | R-squared(training) | R-squared(test) | |
|---|---|---|---|---|
| 0 | Linear Regression | 261.718786 | 0.726466 | 0.743154 |
| 1 | Ridge Regression | 261.736864 | 0.726466 | 0.743118 |
| 2 | Decision Tree Regression | 272.899142 | 0.993208 | 0.720741 |
| 3 | Random Forest Regression | 215.741352 | 0.969587 | 0.825470 |
#here we bar plot the models for easiy comparison
f, axes = plt.subplots(2,1, figsize=(12,8))
#in here we sort the values on r squared in an scending order
predicted.sort_values(by=['R-squared(training)'], ascending=True, inplace=True)
#in here we plot the bar plot based on the data
sns.barplot(x='R-squared(training)', y='Modeltype', data = predicted3, palette='Purples_d', ax = axes[0])
axes[0].set_xlabel('R-squared(Training)', size=12)
axes[0].set_ylabel('Modeltype')
axes[0].set_xlim(0,1.0)
#in the 2nd subplot we plot the rsquared of testing and compare
predicted.sort_values(by=['R-squared(test)'], ascending=True, inplace=True)
sns.barplot(x='R-squared(test)', y='Modeltype', data = predicted3, palette='Oranges_d', ax = axes[1])
axes[1].set_xlabel('R-squared (Test)', size=12)
axes[1].set_ylabel('Modeltype')
axes[1].set_xlim(0,1.0)
plt.show()
What is interesting about this is that, when non tuned the model seems to have a high accuracy. if we compare between the decison tree and the random forest, we see that the decison tree has almost a 100% accuray on training and around 70% on testing, and this is due to the algorithim being overfitted. on the otherhand, the random forest has around 90% accuracy on training and around 80% on testing this is due to the natureof the random forest as it stops the model from overfitting.
what can be seen is that, the difference between the datasets is that, the more data is included the higher the accuracy is. that can be seen above as the non tuned original data set highest training score was 99% and the testing was 82%, measured in r squared. on the other hand the non tuned enhanced data set highest training score was 99% and testing was 92%. This already shows that more data means more accuracy. Now, if we compare the datasets of the tuned original data set and tuned enhanced data set, we see that that the original data set has a highest score of training 96% and testing of 83%. compared to the endhanced data set which had a highest training of 93% and testing of 86%. This shows that the more data added the better the accuracy is. In addition to that, if we look at the performace and why does tuning the parameters matter, we will notice that the reason behind the accuracy is being high before tuning is because the algorithim was being overfitted. That can be seen in the Mean squared error of the algorithims. As before tuning, we notice that the testing MSE is higher than the training MSE, abnd this implies that the model was overfitted. After tuning the models, we noticed that the training MSE value was getting closer to the range of the testing MSE. An example of that is the decison tree algorithim of the original dataset. If we look at that section we will notice that before tuning the model the training MSE was 1785 and testing MSE was 74473, which clearrly shows that this is overfitted. After tuning the model, we notice that the training MSE error is 89764 and the Testing MSE is 87167.Therefore, hyper tuning the parameters and enhancing the data are iportant factors in achiving high accuracy results.
According to the interview that were done with the target audiance and the domain expert, this project can reduce the time taken to search for reliable houses, can predict the the price worth of the house, with a price margin (i.e., RMSE), due to the model being overestimating, giving new coming student a base line for price negotitation, and shows the price margin between the predicted price and actual price to flag the houses as scam.
Below is the table that was fit into the models, in this table the price predicted column will be added. the table will be saved as a CSV and be used in a powerBI application. Please check phase 4 document.
# table that was fed to the modles
house
| areaSqm | longitude | latitude | toilet | shower | kitchen | living | propertyType | rent | owntoilet | sharedtoilet | notoilet | ownshower | sharedshower | noshower | ownkitchen | sharedkitchen | nokitchen | ownliving | sharedliving | noliving | studio | room | appartement | anti | student res | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 30 | 4.920721 | 52.370200 | Own | Own | Own | Own | Studio | 950 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 11 | 4.854786 | 52.350880 | Shared | Shared | Shared | Shared | Room | 1000 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 11 | 60 | 4.879218 | 52.354884 | Own | Own | Own | Own | Apartment | 1590 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 17 | 19 | 4.976048 | 52.326211 | Shared | Shared | Shared | Shared | Room | 750 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 23 | 12 | 4.824007 | 52.352244 | Own | Own | Shared | Own | Room | 800 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 46695 | 20 | 4.988951 | 52.318325 | Shared | Shared | Shared | Shared | Room | 1000 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46698 | 10 | 4.877377 | 52.384534 | Shared | Shared | Shared | Shared | Room | 750 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46701 | 15 | 4.823299 | 52.379522 | Shared | Shared | Shared | Shared | Room | 750 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46710 | 10 | 4.985070 | 52.320498 | Shared | Shared | Shared | Shared | Room | 550 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 46714 | 8 | 4.824709 | 52.382689 | Shared | Shared | Shared | None | Room | 700 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
8074 rows × 26 columns
# In here we insert the predicted value of the best model which was the random forest.
#house=house.insert(9,"pred",rf_orig.predict(X_orig))
#save as a CSV.
#house.to_csv('house.csv')